Project on Featurization and Model Tuning:

Data Description

  • Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.
  • The actual concrete compressive strength (MPa) for a given mixture under a specific age (days) was determined from laboratory. Data is in raw form (not scaled). The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).

</font>

Attribute Information

  • All the features except age are ingredients used for making concrete and decides the strength.
  • All are numeric in nature.

ï‚· Cement : measured in kg in a m3 mixture

ï‚· Blast : measured in kg in a m3 mixture

ï‚· Fly ash : measured in kg in a m3 mixture

ï‚· Water : measured in kg in a m3 mixture

ï‚· Superplasticizer : measured in kg in a m3 mixture

ï‚· Coarse Aggregate : measured in kg in a m3 mixture

ï‚· Fine Aggregate : measured in kg in a m3 mixture

ï‚· Age : day (1~365)

ï‚· Concrete compressive strength measured in MPa

</font>

Objective

The aim of the dataset is to predict the concrete compressive strength of high performance concrete (HPC). HPC always does not mean high strength but covers all kinds of concrete for special application that are not possible with standard concretes. Therefore, our target value is 'Concrete Compressive Strength [MPa] or strength column in MPa'.

</font>

Steps, Tasks and Solutions

  1. Import the necessary libraries

  2. Read the data as a data frame and Preprocessing of the data

  3. Perform basic EDA

  1. Splitting of Dataset into training and testing & Scaling
  1. Iteration -1: With various Linear Models

  1. Iteration -2: Addressing Mix of Gaussian, Generation of Clusters, Addressing Outliers, Modeling Building

  1. Iteration -3: Composite Feature Creation, Generation of Cluster, Outlier Detection & Treatment, Model Building & Testing

  1. Iteration -4: Model Tuning using GridSearch CV and RandomSearch CV

  1. Conclusion

</font>

1. Import the necessary libraries

In [796]:
# importing the necessary package for performing advanced mathematical operation
import math

# importing the necessary package for managing data
import pandas as pd
import numpy as np

# importing the necessary packages for visualisation
import seaborn as sns
from matplotlib import pyplot
import matplotlib.pyplot as plt


sns.set (color_codes = True)
%matplotlib inline 

# commmand to tell Python to display my graphs in a darkgrid format
sns.set_style(style= 'darkgrid')


# for doing statistical calculation
import scipy
from sklearn import linear_model
import statsmodels.api as sm
from sklearn import metrics
from sklearn import datasets
import scipy.stats as stats
from scipy.stats import skew


# pre-processing method
from sklearn.model_selection import train_test_split
from scipy.stats import zscore

# Import Linear Regression machine learning library
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
import xgboost
from xgboost import XGBRegressor

# For creating Polynomial Features
from sklearn.preprocessing import PolynomialFeatures


# methods and classes for evaluation
from scipy.stats import randint as sp_randint
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_validate
from sklearn.utils import resample


import warnings
warnings.filterwarnings("ignore")

2. Read the data as a data frame

In [797]:
df = pd.read_csv('concrete (1).csv')
In [798]:
df.head()
Out[798]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3 212.0 0.0 203.5 0.0 971.8 748.5 28 29.89
1 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 14 23.51
2 250.0 0.0 95.7 187.4 5.5 956.9 861.2 28 29.22
3 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85
4 154.8 183.4 0.0 193.3 9.1 1047.4 696.7 28 18.29

3. Performing basic Exploratory Data Analysis (EDA)

3.a. Description of the dataset (general operations and statistical description)

In [799]:
def indetailtable(df):
    print(f'Dataset Shape: {df.shape}')
    print('Total Number of rows in dataset= {}'.format(df.shape[0]))
    print('Total Number of columns in dataset= {}'.format(df.shape[1]))
    print('Various datatypes present in the dataset are: {}\n'.format(df.dtypes.value_counts()))
    
    summary = pd.DataFrame(df.dtypes, columns = ['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary['Name'] = summary[['Name', 'dtypes']]
    summary['Misssing_values'] = df.isnull().sum().values
    summary['Unique_values'] = df.nunique().values
    summary['Duplicate_values'] = df.duplicated().sum()
    summary['1st value'] = df.loc[0].values
    summary['2nd Value'] = df.loc[1].values
    summary['1028th Value'] = df.loc[1028].values
    summary['1029th Value'] = df.loc[1029].values
    
    return summary
In [800]:
brief = indetailtable(df)
brief
Dataset Shape: (1030, 9)
Total Number of rows in dataset= 1030
Total Number of columns in dataset= 9
Various datatypes present in the dataset are: float64    8
int64      1
dtype: int64

Out[800]:
index dtypes Name Misssing_values Unique_values Duplicate_values 1st value 2nd Value 1028th Value 1029th Value
0 cement float64 cement 0 278 25 141.30 168.90 342.00 540.00
1 slag float64 slag 0 185 25 212.00 42.20 38.00 0.00
2 ash float64 ash 0 156 25 0.00 124.30 0.00 0.00
3 water float64 water 0 195 25 203.50 158.30 228.00 173.00
4 superplastic float64 superplastic 0 111 25 0.00 10.80 0.00 0.00
5 coarseagg float64 coarseagg 0 284 25 971.80 1080.80 932.00 1125.00
6 fineagg float64 fineagg 0 302 25 748.50 796.20 670.00 613.00
7 age int64 age 0 14 25 28.00 14.00 270.00 7.00
8 strength float64 strength 0 845 25 29.89 23.51 55.06 52.61

Comments:

  • There are basically 1030 number of data points / observations and 08 number of columns/features available in the dataset.

  • All the columns are numeric in nature, they are either float or integer type. Here the the column 'strength' is the target column which predicts the concrete compressive strength of high performance concrete (HPC).

  • HPC does not mean high strength but covers all kinds of concrete for special applications that are not possible with standard concerts.

  • The rest of the columns are independent columns, and all are numeric in nature.

  • Here, we don't have missing values but certainly we have some duplicate columns.

  • A lot of unique values can be observed from each of the columns.

  • Further, we will explore more for duplicate columns and impute the proper treatment for avoiding any biasness by our models.

  • Since, the target column is a continuous column, we can expect the models to be linear models rather than classification models.

In [801]:
df.columns
Out[801]:
Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')
In [802]:
for value in ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength']:
    print(value,':', sum(df[value] == '?'))
cement : 0
slag : 0
ash : 0
water : 0
superplastic : 0
coarseagg : 0
fineagg : 0
age : 0
strength : 0
In [803]:
df.isnull().values.any()
Out[803]:
False

  • Neither any 'Null values' nor we have any '?' symbols present in the data set, so we can consider the data set as a cleaned one and eligible for further analysis.
In [804]:
duplicate_rows_df = df[df.duplicated()]

print('Number of duplicated rows:', duplicate_rows_df.shape )
Number of duplicated rows: (25, 9)

Hence, there are 25 rows with duplicate elements present in each column, which can be seen as repeated in nature and can be treated as irrelevant. This problem could be due to things like data entry errors or data collection methods. Whatever the reason may be, duplication may lead to make incorrect conclusions by leading you to believe that some observations are more common than they really are.

In [805]:
duplicate_rows_df_cement = df[df.duplicated(['cement'])]
print(duplicate_rows_df_cement.shape)
(752, 9)
  • So there are cases where cement value is same, we can also check the unique values in the column
In [806]:
print(len(df.cement.unique()))
278

Hence it has total 278 unique values, by adding the unique and duplicated values for the column cement (278+752 = 1030) we have 1030 total number of rows. So it is necessary to retain the unique rows for better analysis, so at least we can drop the rows with same values in all column.

In [807]:
print('Shape of dataframe after dropping duplicates', df.drop_duplicates().shape)
Shape of dataframe after dropping duplicates (1005, 9)

Thus we have deleted total 25 number of duplicated rows. Now the data frame seems to be legit for further EDA.

In [808]:
df.describe()
Out[808]:
cement slag ash water superplastic coarseagg fineagg age strength
count 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000
mean 281.167864 73.895825 54.188350 181.567282 6.204660 972.918932 773.580485 45.662136 35.817961
std 104.506364 86.279342 63.997004 21.354219 5.973841 77.753954 80.175980 63.169912 16.705742
min 102.000000 0.000000 0.000000 121.800000 0.000000 801.000000 594.000000 1.000000 2.330000
25% 192.375000 0.000000 0.000000 164.900000 0.000000 932.000000 730.950000 7.000000 23.710000
50% 272.900000 22.000000 0.000000 185.000000 6.400000 968.000000 779.500000 28.000000 34.445000
75% 350.000000 142.950000 118.300000 192.000000 10.200000 1029.400000 824.000000 56.000000 46.135000
max 540.000000 359.400000 200.100000 247.000000 32.200000 1145.000000 992.600000 365.000000 82.600000

Comments:

In the above table 5 point summary have been described, i.e., minimum value, 25%, 50%, 75%, and maximum values of each column.

  • The statistical table above shows the distribution of each attribute by signifying the unique values of each column.

  • Some of the columns have zero values in it. It means these attribute constituent are not present in the manufacturing of the concrete. This might have happened intentionally to test the strength of the concrete by excluding the constituents.

  • The age column has a minimum value of 1 in it. This value signifies the bare minimum incubation period allowed for the concrete to create the atomic bonding between each constituent. Normally 28 days is the recommendation time to ensure the correct results. However this 1 day period or days below 28 gives less strength to the concrete.

  • Similarly, the distribution of strength column shows the minimum value of 2.33 MPa, means a week concrete with less compressive strength, this might have happened due to improper mixture ratio of constituents, or extremely more or extremely less quantity of water or may be due to very less incubation period of concrete.

  • Though, the columns have a single unit i.e. kg/m^3, except age and strength, they have a huge range of values or scale of values, which may bias the model performance.

  • So to avoid this kind of problem a proper scaling is required may be by MinMaxScalar or StandardScaler or simply by imputing the zscore.

  • Let's explore more about all the attributes individually....
In [809]:
# Cheking IQR
IQR = df.quantile(0.75) - df.quantile(0.25)
IQR
Out[809]:
cement          157.625
slag            142.950
ash             118.300
water            27.100
superplastic     10.200
coarseagg        97.400
fineagg          93.050
age              49.000
strength         22.425
dtype: float64
  • The IQR of all the attributes are calculated, and we can see the cement has highest Inter Quartile Range and strength has lowest Inter Quartile Range.
In [810]:
# Checking range
print( 'Range in Cement:', df.cement.max() - df.cement.min())
print('Range in Slag:',df.slag.max() - df.slag.min())
print('Range in Ash:',df.ash.max() - df.ash.min())
print('Range in water:',df.water.max() - df.water.min())
print('Range in Superplastic:',df.superplastic.max() - df.superplastic.min())
print('Range in Coarseagg:',df.coarseagg.max() - df.coarseagg.min())
print('Range in fineagg:',df.fineagg.max() - df.fineagg.min())
print('Range in Age:',df.age.max() - df.age.min())
print('Range in Strength:',df.strength.max() - df.strength.min())
Range in Cement: 438.0
Range in Slag: 359.4
Range in Ash: 200.1
Range in water: 125.2
Range in Superplastic: 32.2
Range in Coarseagg: 344.0
Range in fineagg: 398.6
Range in Age: 364
Range in Strength: 80.27
  • Here the range (the difference between maximum and minimum values) has been calculated for each attributes.
In [811]:
# Checking for the 1.5*IQR for presence of outliers
1.5*IQR
Out[811]:
cement          236.4375
slag            214.4250
ash             177.4500
water            40.6500
superplastic     15.3000
coarseagg       146.1000
fineagg         139.5750
age              73.5000
strength         33.6375
dtype: float64
  • To calculate the spread of outliers we will be taking the 1.5 times Inter Quartile Range value.
In [812]:
 (df.max()-df.min()) - 1.5*IQR 
Out[812]:
cement          201.5625
slag            144.9750
ash              22.6500
water            84.5500
superplastic     16.9000
coarseagg       197.9000
fineagg         259.0250
age             290.5000
strength         46.6325
dtype: float64
  • Thus by subtracting the 1.5*IQR from range we can get the spread of the outliers beyond the quartile range or whisker.

  • The above difference shows us the clear-cut presence of outliers beyond upper quartile range.

3.b. Univariate & Bivariate Analysis

In [813]:
# Checking the total number of unique values for target column
pd.value_counts(df['strength'])
Out[813]:
33.40    6
79.30    4
41.05    4
71.30    4
35.30    4
        ..
61.23    1
26.31    1
38.63    1
47.74    1
15.75    1
Name: strength, Length: 845, dtype: int64
In [814]:
plt.figure(figsize = (20,4))
plt.subplot(1,3,1)

plt.hist(df.strength , color = 'blue', edgecolor = 'black',  alpha = 0.7);
plt.xlabel('Strength of Concrete in MPa');

plt.subplot(1,3,2)
sns.violinplot(df['strength'], color = 'blue')
plt.xlabel('Strength of Concrete in MPa');

plt.subplot(1,3,3)
sns.boxplot(x = df.strength,palette = 'YlOrRd')
plt.xlabel('Strength of Concrete in MPa');

Observations:

  • The Strength column here is the target column, which is to be predicted for a concrete based upon the content of constituents in it. The strength here signifies the Compressive strength of an object. This compressive strength is measured in a Universal Testing machine by applying a uniform compressive force to the specimen and it is measured as Mega Pascal (MPa) generally in SI unit.
  • Compressive strength is the capacity of a material/structure to withstand loads tending to reduce size as opposed to tensile strength, which withstands loads tending to elongate.
  • This compressive strength of concrete determines the quality of concrete or gives us the grade of the concrete manufactured by using different ingredients within a given incubation period. This quality is highly influenced by the constituents used for producing it. These constituents are given as independent attributes in this dataset.
  • The strength is also dependent upon the number days we allow it to solidify in turn we allow it to create intermolecular bonds between each of the constituents. This period is known as incubation period and generally measures in terms of days.
  • A mould is prepared to bring the desired shape of the concrete, and the semi solid mixture is allowed to pour into the mould to attain the shape and generally based upon the requirement of strength, the incubation period is decided and then it is allowed to solidify.
  • An approximate normal distribution of the target column can be observed, however the long tail towards right in Violin plot shows the presence of outliers.
  • This leverage of outliers can be observed in Box-whisker plot, and the outliers are situated near by to the upper quartile range or maximum values.
  • Thus a proper treatment is necessary to avoid any kind of biasness during model building and predicting model performance.
In [815]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.cement , color = 'green', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of Cement in Kg per cubic meter of Concrete')

plt.subplot(2,2,2)
sns.violinplot(df['cement'], palette = 'gist_rainbow')
plt.xlabel('Quantity of Cement in Kg per cubic meter of Concrete')

plt.subplot(2,2,3)
sns.boxplot(x = df.cement, palette = 'copper')
plt.xlabel('Quantity of Cement in Kg per cubic meter of Concrete');

plt.subplot(2,2,4)
sns.scatterplot(df.cement, df.strength, color = 'green')
plt.title('Variation of strength with respect to Slag content', color = 'brown');

Observations:

  • Cement is one of the independent attribute, which plays a major role in building compressive strength of the concrete. By adding more cement, strength of the concrete increases.

  • However, if excessive amount of cement is added to the concrete then high heat of hydration will be generated, which in turn induces the thermal stresses in concrete by creating cracks.

  • The distribution of the data for this column is not normal. The column seems to be right skewed and we don't have outliers present in it.

In [816]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.slag , color = 'violet', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of slag in Kg per cubic meter of Concrete')

plt.subplot(2,2,2)
sns.violinplot(df['slag'], color = 'violet')
plt.xlabel('Quantity of slag in Kg per cubic meter of Concrete')

plt.subplot(2,2,3)
sns.boxplot(x = df.slag, color = 'teal')
plt.xlabel('Quantity of slag in Kg per cubic meter of Concrete');

plt.subplot(2,2,4)
sns.scatterplot(df.slag, df.strength,color = 'violet' )
plt.title('Variation of strength with respect to Slag content', color = 'brown');

Observations:

  • Slag is a by-product generated out from blast furnace. This constituent plays a major role in deciding the strength of concrete.

  • It acts like a binder and helps to increase the durability and strength of the concrete. However, the hardening process takes longer time to reach full compressive strength.

  • Initially the distribution seems to be normally distributed, however as the data point increases it becomes right skewed with a long tail representing the presence of outliers.

  • Multi modal distribution signifies the two clusters due to mix of Gaussians. This might have happened during data collection.

  • Box-whisker plot shows the presence of outliers beyond upper quartile range.

In [817]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.ash , color = 'orange', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of ash in Kg per cubic meter of Concrete')

plt.subplot(2,2,2)
sns.violinplot(df['ash'], color = 'orange')
plt.xlabel('Quantity of ash in Kg per cubic meter of Concrete')

plt.subplot(2,2,3)
sns.boxplot(x = df.ash, palette = 'gnuplot2')
plt.xlabel('Quantity of ash in Kg per cubic meter of Concrete');

plt.subplot(2,2,4)
sns.scatterplot(df.ash, df.strength, color = 'teal' );
plt.title('Variation of strength with respect to Fly ash content', color = 'brown');

Observations:

  • Fly ash is produced in small dark flakes by the burning of powdered coal.

  • It acts like a good binder like slag to improve the durability and strength of the concrete. However, the hardening process takes longer time to reach full compressive strength.

  • By addition of fly ash reduces the concrete bleeding and improves its workability. Fly ash can improve the long-term compressive strength of the conventional concrete.

  • The distribution of the column seems to be a complete mix of Gaussians. A gap or discontinuity between two Gaussian indicates the presence of two clusters in the column.

  • Though the distribution seems to be right tailed, the absence of outliers can be observed in this.

In [818]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.water , color = 'pink', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of water in Kg per cubic meter of Concrete')

plt.subplot(2,2,2)
sns.violinplot(df['water'], color = 'pink')
plt.xlabel('Quantity of water in Kg per cubic meter of Concrete')

plt.subplot(2,2,3)
sns.boxplot(x = df.water, palette = 'inferno')
plt.xlabel('Quantity of water in Kg per cubic meter of Concrete');

plt.subplot(2,2,4)
sns.scatterplot(df.water, df.strength,color = 'purple' );
plt.title('Variation of strength with respect to water content', color = 'brown');

Observations:

  • Water plays a role of binder between all the constituents to create a concrete.

  • An incorrect proportion of water in concrete may lead to less strength or cracks.

  • Too much of water or over-watered concrete may lead to lower strength, reduced durability, shrinkage cracking and a variety of surface problems. Thus, proper amount of water to be added to concrete with all other constituents to ensure the desired strength.

  • The Distribution of the column seems to be multimodal in nature with more than three clusters.

  • The long extended tails in the distribution to both the sides indicates the leverage of outliers in the attribute.

  • From the Box-Whiskers plot we can observe the presence of outliers on both the sides.

In [819]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.superplastic , color = 'purple', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of superplastic in Kg per cubic meter of Concrete')

plt.subplot(2,2,2)
sns.violinplot(df['superplastic'], color = 'purple')
plt.xlabel('Quantity of superplastic in Kg per cubic meter of Concrete')

plt.subplot(2,2,3)
sns.boxplot(x = df.superplastic, color = 'white')
plt.xlabel('Quantity of superplastic in Kg per cubic meter of Concrete');

plt.subplot(2,2,4)
sns.scatterplot(df.superplastic, df.strength,color = 'violet');
plt.title('Variation of strength with respect to Superplasticizer content', color = 'brown');

Observations:

  • The Superplasticizer is used to ensure better flow properties because they minimize particle segregation. Further, they allow to decrease the water-cement ratio which leads to higher compressive strength.

  • In the case of hardened concrete the use of Superplasticizer will increase compressive strength by enhancing the effectiveness of compaction to produce denser concrete.

  • The distribution of the column clearly indicates the mix of Gaussian. The distribution curve seems to be bimodal in nature with uneven distribution.

  • The right skewed data with right long tail in violin plot shows indicates the presence of outliers in right side.

  • The box-whisker plot shows the presence of outliers in right side at a very high distance from upper quartile.

In [820]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.coarseagg , color = 'red', edgecolor = 'black', alpha = 0.7);
plt.xlabel('Quantity of coarseagg in Kg per cubic meter of Concrete')

plt.subplot(2,2,2)
sns.violinplot(df['coarseagg'], color = 'red')
plt.xlabel('Quantity of coarseagg in Kg per cubic meter of Concrete')

plt.subplot(2,2,3)
sns.boxplot(x = df.coarseagg, color = 'orange')
plt.xlabel('Quantity of coarseagg in Kg per cubic meter of Concrete');

plt.subplot(2,2,4)
sns.scatterplot(df.coarseagg, df.strength, color = 'orange');
plt.title('Variation of strength with respect to coarse aggregate content', color = 'brown');

Observations:

  • Coarseagg stands for Coarse Aggregate column. It generally consists of sand, Grave and crushed stone.

  • The larger percentage of coarse aggregate in concrete mix makes it to contribute a lot to the strength of concrete. However, the tensile strength of the concrete gets severely affected by increasing the size of coarse aggregate.

  • In this column, the distribution seems to be multimodal and it can be due to mix of Gaussian once again and may be due to uneven distribution of data points.

  • This column doesn't have any outliers so it is free from baseness arises out of outliers.

In [821]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.fineagg , color = 'white', edgecolor = 'red', alpha = 0.7);
plt.xlabel('Quantity of fineagg in Kg per cubic meter of Concrete')

plt.subplot(2,2,2)
sns.violinplot(df['fineagg'], color = 'white',edgecolor = 'red')
plt.xlabel('Quantity of fineagg in Kg per cubic meter of Concrete')

plt.subplot(2,2,3)
sns.boxplot(x = df.fineagg, palette = 'gist_rainbow')
plt.xlabel('Quantity of fineagg in Kg per cubic meter of Concrete');

plt.subplot(2,2,4)
sns.scatterplot(df.fineagg, df.strength, color = 'red');
plt.title('Variation of strength with respect to fine aggregate content', color = 'brown');

Observations:

  • fineagg stands for Fine Aggregate, it consist of small size of coarse aggregate with fine crushed stones. It mostly contains the sand.

  • This increases the flexural strength of the concrete. However the workability of concrete decreases as fine content increases.

  • It helps in binding cement properly and thus increases the strength of the concrete. Beyond certain level of addition, it decreases the strength of the concrete.

  • The distribution of the column is nearly normal distribution. However, the right tailed data points indicate the presence of outliers.

  • The same can also be inferred from the Box-Whisker plot.

In [822]:
plt.figure(figsize = (20,12))
plt.subplot(2,2,1)
plt.hist(df.age , color = 'brown', edgecolor = 'green', alpha = 0.7);
plt.xlabel('Incubation period in number of days')

plt.subplot(2,2,2)
sns.violinplot(df['age'], color = 'brown',edgecolor = 'green', alpha = 0.7)
plt.xlabel('Incubation period in number of days')

plt.subplot(2,2,3)
sns.boxplot(x = df.age,palette = 'gist_stern')
plt.xlabel('Incubation period in number of days');

plt.subplot(2,2,4)
sns.scatterplot(df.age, df.strength,color = 'brown');
plt.title('Variation of strength with respect to incubation period', color = 'brown');

Observations:

  • Age is a deciding factor for increase in strength. The compressive strength of the concrete increases with age. However, this increase in strength gets stagnated after one year.

  • As per the industry standard, 28 days period is the basic period to attain a level of compressive strength.

  • In the histogram plot most of the data points are located between 0 to 100 days, and 0 to 25 days period has maximum count.

  • Rarely some of the concrete were allowed to solidify and gain strength till a period of 350 days. The presence of multi Gaussian and multi modal indicates 3 to 4 clusters. And the violin plot clearly shows the skewness of data points towards right. The very long tail towards right indicates the presence of outliers in the column.

  • The same can be noticed from box-whisker plot, and the outliers are located far away from the upper quartile range.

In [823]:
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(y="strength", x="cement", hue="water", size="age", data=df, ax=ax, sizes=(50, 300))
ax.set_title("strength vs (cement, age, water)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

Observations:

  • It can be seen that, as the quantity of cement increases the strength also increases, and this is possible at a certain ratio of cement to water and at a certain age period.

  • Some points show that, at a less proportion of cement also concrete has a good strength and it is due to the proper quantity of water in it and a good time period to strengthen the concrete.

In [824]:
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(y="strength", x="fineagg", hue="ash", size="superplastic", data=df, ax=ax, sizes=(50, 300))
ax.set_title("strength vs (fineagg, superplastic, ash)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

Observations:

  • A linear relationship cannot be expected between fine aggregate and strength column. At extreme proportion of fine aggregate, ash quantity seems to be very less with minimum superplastic and a moderate value of strength.

  • An optimal quantity of fine aggregate with equal amount of ash and super plastic give a good strength concrete.

In [825]:
fig, ax = plt.subplots(figsize=(10,7))
sns.scatterplot(y="strength", x="superplastic", hue="water", size="fineagg", data=df, ax=ax, sizes=(50, 300))
ax.set_title("strength vs (superplastic,fineagg , water)")
ax.legend(loc="upper left", bbox_to_anchor=(1,1))
plt.show()

Observations:

  • Superplastic of around 5 to 25 with a quantity of 150-180 for water and 650 quantity value of fine aggregate mixture gives a good strength of concrete.

  • Strength of concrete highly depends on the quantity of water present in it along with a fixed amount of fine aggregate. Dropping of superplastic completely from mixture has no effect on strength determination.

In [826]:
# Preparing a pandas dataframe to store the skewness of each column.
Skewness = pd.DataFrame({'Skewness': [stats.skew(df.cement), stats.skew(df.slag), 
                                      stats.skew(df.ash), stats.skew(df.water), 
                                      stats.skew(df.superplastic), 
                                      stats.skew(df.coarseagg), stats.skew(df.fineagg),
                                      stats.skew(df.age), stats.skew(df.strength)]}, 
                        index = ['cement', 'slag', 'ash', 'water', 'superplastic',
                                 'coarseagg','fineagg', 'age', 'strength'])
Skewness
Out[826]:
Skewness
cement 0.508739
slag 0.799550
ash 0.536571
water 0.074520
superplastic 0.905881
coarseagg -0.040161
fineagg -0.252641
age 3.264415
strength 0.416370

Observations:

  • Feature 'age' is highly right skewed due to presence of outliers, it is the time required to induce the strength into concrete, so some of the concretes have been kept for long days and allowed to strengthen by creating data points far away from IQR range.

  • Features like 'superplastic', 'Slag', 'ash' and 'cement' are the ingredients which determine the strength of the concrete. By adding these in correct proportion enhances the strength. The skewness in these columns may be due to incorrect proportion of mixture. In some cases the proportion may be due to presence of too much higher or lower quantity of ingredients with respect others, thus, by exceeding the IQR range.

In [827]:
# Correlation of entire dataframe
corr_matrix = df.corr()
# Features more related to Strength of Concrete
corr_matrix['strength'].sort_values(ascending = False)
Out[827]:
strength        1.000000
cement          0.497832
superplastic    0.366079
age             0.328873
slag            0.134829
ash            -0.105755
coarseagg      -0.164935
fineagg        -0.167241
water          -0.289633
Name: strength, dtype: float64

Observations:

  • Out of all features, cement has highest correlation value with target column followed by superplastic, age and slag.

  • Though, the correlation is not linear or value nearly equal to 1 or -1, some of the independent features influence the strength column.

3.c. Multivariate analysis

In [828]:
sns.pairplot(df, diag_kind = 'kde',  markers="^",palette = 'git_ow',
                      plot_kws=dict(s=50, edgecolor="red", linewidth=0.5),
                      diag_kws=dict(shade=True));

Observations:

  • The pair plot panel gives us some important visual observations among independent columns and target column.

  • From the KDEs i.e. diagonal parts it can be observed that, there may be 2 to 7 clusters available in the entire dataset.

  • Some of the columns are having a data cloud but the spread seems to be higher.

  • The collection of data except age column, gives almost a cloud with each other, however a linear vertical line can also be seen, which clearly indicates the accumulation of data points at a single value or record.

  • The age column which ranges from 1 to 365, gives us almost 5 clusters, with 5 linear lines stating the count of days for each five records.

  • None of the columns are highly correlated, means none of them are influencing each other in any manner. So absence of multicollinearity can be felt among them.

  • This is a good sign of relationship among independent attributes, and can faintly affect the model performance.

  • Due to the presence of long tails, the leverage of outliers can be felt, and proper feature engineering is required to gain model performance.

  • The relationship of all the independent attributes is nonlinear with target column.

  • Except the cement column, all are having a poor correlation with target or strength column.

  • Though the age column has a linear relationship with target column at different clusters, the slope seems to be equal to zero.

  • Since, most of the data points are cluttered in scatter cloud for a mathematical space between target column and independent columns, a nonlinear relationship can be inferred among them signifying a poor correlation between target and independent attributes.

  • Thus the columns having poor correlation values with target column can be merged and some composite attributes can be generated.

In [829]:
df.corr().T
Out[829]:
cement slag ash water superplastic coarseagg fineagg age strength
cement 1.000000 -0.275216 -0.397467 -0.081587 0.092386 -0.109349 -0.222718 0.081946 0.497832
slag -0.275216 1.000000 -0.323580 0.107252 0.043270 -0.283999 -0.281603 -0.044246 0.134829
ash -0.397467 -0.323580 1.000000 -0.256984 0.377503 -0.009961 0.079108 -0.154371 -0.105755
water -0.081587 0.107252 -0.256984 1.000000 -0.657533 -0.182294 -0.450661 0.277618 -0.289633
superplastic 0.092386 0.043270 0.377503 -0.657533 1.000000 -0.265999 0.222691 -0.192700 0.366079
coarseagg -0.109349 -0.283999 -0.009961 -0.182294 -0.265999 1.000000 -0.178481 -0.003016 -0.164935
fineagg -0.222718 -0.281603 0.079108 -0.450661 0.222691 -0.178481 1.000000 -0.156095 -0.167241
age 0.081946 -0.044246 -0.154371 0.277618 -0.192700 -0.003016 -0.156095 1.000000 0.328873
strength 0.497832 0.134829 -0.105755 -0.289633 0.366079 -0.164935 -0.167241 0.328873 1.000000
In [830]:
colormap = plt.cm.viridis # Color range to be used in heatmap
plt.figure(figsize = (15,12))
plt.title('Pearson Correlation of attributes', y=1, size = 20)
sns.heatmap(df.corr(), linewidth = 0.2, vmax = 1.0,
           square = True,  cmap = colormap,linecolor = 'red', annot = True); 

Observations:

  • None of the columns are highly correlated to each other.

  • The highest correlation among columns can be observed between superplastic and water, which is -0.66 means negatively correlated.

  • The target column has a good correlation with cement column, however, none of the other columns has a good linear relationship with target column.

  • The correlation value of some of the independent attributes with target is nearly equal, so merging of those columns can also be thought off.

  • So, here we do not need to drop any columns based upon the correlation values rather we can create some composite attributes by merging the columns with very less or nearly equal correlation value with target column.

3.d. Comments

  • In this part we have cleaned the data by imputing various data cleaning mechanisms.

  • The Exploratory Data Analysis has been carried out with the help of statistical description, Univariate, Bivariate & Multivariate plots, Correlation matrix, while checking for the IQR range, skewness of each attribute, correlation of independent attributes, mean, mode and median with standard deviation were also been calculated for each attribute.

  • Box plots, Histogram Plots, Violin Plots for density curves and scatter plot between each independent variable with respect to target column were also been reflected.

  • Outliers were also been detected with the help of Box plot for various attributes.

  • Mix of Gaussian has also been checked from pair plot. The probable clusters will be made in subsequent iteration part by using Unsupervised Learning process.

  • Presence of Outliers will also be addressed in subsequent iteration parts to enhance the model performance.

4. Splitting of Dataset into training and testing & Scaling

In [831]:
df_scaled  = df.apply(zscore)
In [832]:
# converting the numpy array back into a dataframe

df_scaled = pd.DataFrame(df_scaled, columns = df.columns )
In [833]:
df_scaled.head()
Out[833]:
cement slag ash water superplastic coarseagg fineagg age strength
0 -1.339017 1.601441 -0.847144 1.027590 -1.039143 -0.014398 -0.312970 -0.279733 -0.355018
1 -1.074790 -0.367541 1.096078 -1.090116 0.769617 1.388141 0.282260 -0.501465 -0.737108
2 -0.298384 -0.856888 0.648965 0.273274 -0.118015 -0.206121 1.093371 -0.279733 -0.395144
3 -0.145209 0.465044 -0.847144 2.175461 -1.039143 -0.526517 -1.292542 -0.279733 0.600806
4 -1.209776 1.269798 -0.847144 0.549700 0.484905 0.958372 -0.959363 -0.279733 -1.049727
In [834]:
X_scaled = df_scaled.drop('strength', axis=1)
y_scaled = df_scaled[['strength']]
In [835]:
# Split X and y into training and test set in 70:30 ratio

X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y_scaled, test_size=0.30, random_state=1)
In [836]:
# checking the split of data
print('{0:0.2f}% data is in training set'.format((len(X_train_scaled)/len(df.index))*100))
print('{0:0.2f}% data is in testing set'.format((len(X_test_scaled)/len(df.index))*100))
70.00% data is in training set
30.00% data is in testing set

5. Iteration -1: With various Linear Models

5.a. Train and test using various Linear Algorithms

Linear Regression Model:

In [837]:
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train_scaled)
Out[837]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [838]:
# Let us explore the coefficients for each of the independent attributes

for idx, col_name in enumerate(X_train_scaled.columns):
    print("The coefficient for {} is {}".format(col_name, lr_model.coef_[0][idx]))
The coefficient for cement is 0.7917080306678761
The coefficient for slag is 0.5620608749496463
The coefficient for ash is 0.347476771026321
The coefficient for water is -0.1318613837251486
The coefficient for superplastic is 0.1344987684064457
The coefficient for coarseagg is 0.1118368924153759
The coefficient for fineagg is 0.1737008031228733
The coefficient for age is 0.41759097790251387
In [839]:
intercept = lr_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))
The intercept for our model is -0.01304140022160526
In [840]:
# Model score - R2 or coeff of determinant
# R^2=1–RSS / TSS


# Model score - R2 or coeff of determinant
# R^2=1–RSS / TSS =  RegErr / TSS

lr_model_score_train = lr_model.score(X_train_scaled, y_train_scaled)

print('Training model Accuracy value: {0:0.2f}%'.format(lr_model_score_train*100))

lr_model_score_test = lr_model.score(X_test_scaled, y_test_scaled)
print('Testing Model Accuracy value: {0:0.2f}%'.format(lr_model_score_test*100))

# lr_model.score(X_test_scaled, y_test_scaled)
Training model Accuracy value: 60.31%
Testing Model Accuracy value: 63.39%

Decision on Complexity of Model: Simple Linear Model or Quadratic Model?

In [841]:
# Is OLS a good model ? Should we be building a simple linear model ? Check the residuals for each predictor.

fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['cement'], y= y_test_scaled['strength'], color='green', lowess=True )


fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['superplastic'], y= y_test_scaled['strength'], color='red', lowess=True );


fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['age'], y= y_test_scaled['strength'], color='blue', lowess=True );


fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['slag'], y= y_test_scaled['strength'], color='violet', lowess=True );


fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['ash'], y= y_test_scaled['strength'], color='purple', lowess=True );


fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['coarseagg'], y= y_test_scaled['strength'], color='orange', lowess=True );


fig = plt.figure(figsize=(10,5))
sns.residplot(x= X_test_scaled['fineagg'], y= y_test_scaled['strength'], color='brown', lowess=True );

Observations on model complexity:

  • The term 'Linear' can be interpreted in two ways

    Linearity in the variables - the independent variables x are raised to the power 1 Linearity in the parameters - The coefficient are raised to power 1, and x can be raised to any power.

  • So, here linear model means, linear in terms of parameters, the power of coefficient should be raised to power one but x can be of any power.

  • Complexity of the model also depends on the dimension we have included to build it. As the number of dimension increases, the complexity of the model also increases.

  • From the above residual plot the single pick and valley represents the model is quite linear in nature and simple. With more than one picks and valleys the model can be treated as quadratic model.

  • The Stochastic Disturbance or Stochastic Error term here is caused due to independent variables which are not taken into considerations, and this can be the error of disturbances which affects the target variable.

  • So from above graphical representation and fundamentals of Linear regression, it can be concluded that , the model will be simple linear model.

  • Further, we will check this in Ridge and Lasso models below and can reach out to conclusion.

Observations:

  • So the model explains 64% of the variability in Y using X

  • R^2 is not a reliable metric as it always increases with addition of more attributes even if the attributes have no influence on the predicted variable. Instead we can use adjusted R^2 which removes the statistical chance that improves R^2.

  • Scikit does not provide a facility for adjusted R^2... so we use will be using statsmodel, a library that gives results similar to what you obtain in R language.

  • This library expects the X and Y to be given in one single data frame

In [842]:
train_data = pd.concat([X_train_scaled, y_train_scaled], axis=1)
train_data.head()
Out[842]:
cement slag ash water superplastic coarseagg fineagg age strength
185 0.658961 -0.856888 -0.847144 1.004164 -1.039143 0.013910 0.017714 -0.501465 -0.795799
286 0.888723 1.337055 -0.847144 -0.537264 0.652383 -0.602435 -0.210645 0.718062 1.741687
600 -0.039901 -0.856888 -0.847144 0.441941 -1.039143 -0.063294 1.028482 -0.675683 -1.464756
691 0.946164 0.244722 -0.847144 2.175461 -1.039143 -0.526517 -2.240917 -0.612331 -0.179544
474 0.716401 -0.856888 1.372788 0.535645 0.803113 -2.212138 0.055149 -0.279733 0.302560
In [843]:
import statsmodels.formula.api as smf
lm1 = smf.ols(formula = 'strength ~ cement+slag+ash+water+superplastic+coarseagg+fineagg+age', data = train_data).fit()
lm1.params
Out[843]:
Intercept      -0.013041
cement          0.791708
slag            0.562061
ash             0.347477
water          -0.131861
superplastic    0.134499
coarseagg       0.111837
fineagg         0.173701
age             0.417591
dtype: float64
In [844]:
print(lm1.summary()) # Inferential Statistics
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               strength   R-squared:                       0.603
Model:                            OLS   Adj. R-squared:                  0.599
Method:                 Least Squares   F-statistic:                     135.3
Date:                Sun, 06 Sep 2020   Prob (F-statistic):          2.18e-137
Time:                        20:57:34   Log-Likelihood:                -683.03
No. Observations:                 721   AIC:                             1384.
Df Residuals:                     712   BIC:                             1425.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -0.0130      0.023     -0.557      0.578      -0.059       0.033
cement           0.7917      0.064     12.444      0.000       0.667       0.917
slag             0.5621      0.063      8.912      0.000       0.438       0.686
ash              0.3475      0.058      5.940      0.000       0.233       0.462
water           -0.1319      0.059     -2.228      0.026      -0.248      -0.016
superplastic     0.1345      0.039      3.445      0.001       0.058       0.211
coarseagg        0.1118      0.052      2.169      0.030       0.011       0.213
fineagg          0.1737      0.060      2.877      0.004       0.055       0.292
age              0.4176      0.025     16.954      0.000       0.369       0.466
==============================================================================
Omnibus:                        2.930   Durbin-Watson:                   2.013
Prob(Omnibus):                  0.231   Jarque-Bera (JB):                2.793
Skew:                          -0.148   Prob(JB):                        0.248
Kurtosis:                       3.069   Cond. No.                         8.54
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

  • Here, adjusted R^2 is also coming very less, in fact less than expected R^2 value. So we'll be using regularised Linear Regression models like Ridge and Lasso to increase the model score...
In [845]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse = np.mean((lr_model.predict(X_test_scaled)-y_test_scaled)**2)
In [846]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse)
Out[846]:
0.618177114503482
  • So there is an average of 0.61 (roundoff) predicted strength differene from real or actual strength.
In [847]:
print('Training model Accuracy value: {0:0.2f}%'.format(lr_model_score_train*100))

print('Testing Model Accuracy value: {0:0.2f}%'.format(lr_model_score_test*100))


y_pred_scaled = lr_model.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_scaled), kind = 'reg', color = 'red')
Training model Accuracy value: 60.31%
Testing Model Accuracy value: 63.39%

Observations:

  • The scatter plot shown above is drawn between actual target feature and predicted target feature.

  • Though, the variation of points are linear in nature, the scattered points are located beyond a virtual linear line indicates the model incapacity to predict the target feature perfectly.

  • Let us build regularised models to predict the target feature

Regularized RIDGE Model:

In [848]:
ridge = Ridge(alpha=.3)      # Lambda or aplha or hyperparameter value = 0.3
ridge.fit(X_train_scaled,y_train_scaled)
print ("Ridge model:", (ridge.coef_))
Ridge model: [[ 0.78652136  0.55696657  0.34291358 -0.13534196  0.13450497  0.10831507
   0.16918324  0.41713594]]
In [849]:
print('Training model Accuracy value: {0:0.2f}%'.format((ridge.score(X_train_scaled,y_train_scaled))*100))

print('Testing Model Accuracy value: {0:0.2f}%'.format((ridge.score(X_test_scaled, y_test_scaled))*100))

y_pred_scaled_ridge = ridge.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_scaled_ridge), kind = 'reg', color = 'green')
Training model Accuracy value: 60.31%
Testing Model Accuracy value: 63.40%
In [850]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse2 = np.mean((ridge.predict(X_test_scaled)-y_test_scaled)**2)
In [851]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse2)
Out[851]:
0.6180760557197372

Regularized LASSO Model:

In [852]:
lasso = Lasso(alpha=0.1)      # Lambda or aplha or hyperparameter value = 0.1
lasso.fit(X_train_scaled,y_train_scaled)
print ("Lasso model:", (lasso.coef_))
Lasso model: [ 0.39456594  0.14777481  0.         -0.11823668  0.19361099 -0.
 -0.          0.2543608 ]
  • Observe, many of the coefficients have become 0 indicating drop of those dimensions from the model
In [853]:
print('Training model Accuracy value: {0:0.2f}%'.format((lasso.score(X_train_scaled,y_train_scaled))*100))

print('Testing Model Accuracy value: {0:0.2f}%'.format((lasso.score(X_test_scaled, y_test_scaled))*100))

y_pred_scaled_lasso = lasso.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_scaled_lasso), kind = 'reg', color = 'teal')
Training model Accuracy value: 52.16%
Testing Model Accuracy value: 51.86%

Observations on model complexity:

  • Continuing the above discussion, we have built two regularised linear models.

  • Both the models have tried to minimize the SSE (Sum of Square error) by finding out the least combination of variables.

  • The ridge model has tried to push the coefficients towards zero by penalizing large magnitude coefficients while keeping the cost function same as linear un-regularised model.

  • Thus by doing so, coefficient of some of the variables are dragged towards zero and made the model simple and prevented it from overfitting.

  • Similarly, for Lasso model unlike Ridge, the penalty term makes the coefficients to zero which has high magnitude. Thus, by dropping the variable itself.

  • This penalty term ensures the corresponding variable is totally dropped from the model building and makes the model much simpler and prevents it from overfitting. However the cost function is retained unaffected and almost equal to the cost function of non-regularised linear model.

Thus from the above discussion, it can be concluded that, for this dataset we will be building very simple linear models and avoid any higher degree or quadratic in terms of parameters.

Observations:

  • The scatter plot shown above is drawn between actual target feature and predicted target feature for both Regularised Ridge and Lasso models.

  • Similarly, though, the variation of points are linear in nature, the scatterings of the points beyond a virtual linear line indicates the model shapelessness to predict the target feature perfectly.

  • Here, Regularised models also failed to improve the score and prediction capability like un-regularised model. The scores are also nearly same to un-regularised model.

  • Let us try to improve the scores by generating polynomial features for linear models which will reflect the non-linear interaction between some dimensions while considering the significant correlation among them.

Let us generate polynomial models reflecting the non-linear interaction between some dimensions

In [854]:
poly = PolynomialFeatures(degree = 2, interaction_only = True)

# degree = 2: means we have allowed to creat polynomials upto the power 2 of the existing columns.
# interaction_only = True: those dimensions are taken which shows some correlation among them
In [855]:
X_poly= poly.fit_transform(X_scaled)
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_poly, y_scaled, test_size=0.30, random_state= 10248)
X_train_p.shape
Out[855]:
(721, 37)
  • Thus, after imputing generation of polynomial features, we got some 37 number of total dimensions. Initially it was 9 dimensions including target column. Let's explore the score of non-regularised and regularised models by using these polynomial features.

Fiting a simple non regularized linear model on poly features

  • Fit the polynomial features to the models and check for the coefficients.
In [856]:
lr_model.fit(X_train_p, y_train_p)
print(lr_model.coef_[0])
[ 6.39972415e-17  7.34602554e-01  5.68042742e-01  3.39048716e-01
 -1.74984578e-01  2.16214883e-01 -1.12973623e-02  2.98548582e-02
  8.93807633e-01  9.92122871e-02  1.45588093e-01 -2.23719119e-01
 -2.21089909e-01  2.38849334e-02  3.07479840e-02  1.39685296e-01
  1.87415624e-01 -1.37371427e-01 -1.37124684e-01 -1.08772156e-02
  1.40871194e-01  2.39528051e-01 -1.54164381e-01 -2.29300129e-01
  5.89403027e-02  1.67826449e-01  3.41614739e-01 -9.69809217e-03
 -1.12419999e-01 -2.57775313e-02 -4.54021542e-02 -5.29730045e-02
 -6.25986274e-02  2.02895813e-01  8.05425167e-02  3.72501669e-02
  1.31563070e-01]
In [857]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse11 = np.mean((lr_model.predict(X_test_p)-y_test_p)**2)
In [858]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse11)
Out[858]:
0.5061410103618659
In [859]:
ridge = Ridge(alpha =0.3)
ridge.fit(X_train_p, y_train_p)
print('Ridge model:', (ridge.coef_))
Ridge model: [[ 0.          0.72689315  0.55965677  0.33196901 -0.18016786  0.21667903
  -0.01522698  0.02323714  0.89202017  0.09776932  0.14448824 -0.22021188
  -0.21449324  0.02436072  0.03061528  0.13144826  0.18453406 -0.13482714
  -0.13122145 -0.01147661  0.1404346   0.23271311 -0.15148918 -0.22484575
   0.05810602  0.16703418  0.33292017 -0.00644743 -0.1108817  -0.02539434
  -0.05059566 -0.04914613 -0.05820805  0.20402028  0.08058942  0.03544339
   0.12496179]]
In [860]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse22 = np.mean((ridge.predict(X_test_p)-y_test_p)**2)
In [861]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse22)
Out[861]:
0.5061958130880813
In [862]:
lasso = Lasso(alpha =0.2)
lasso.fit(X_train_p, y_train_p)
print('Lasso model:', (lasso.coef_))
Lasso model: [ 0.          0.29911105  0.01657977 -0.         -0.01049749  0.17274419
 -0.         -0.          0.15160356  0.         -0.         -0.
  0.         -0.         -0.         -0.         -0.         -0.
  0.         -0.          0.          0.          0.         -0.
  0.          0.          0.         -0.         -0.         -0.
 -0.         -0.          0.         -0.         -0.          0.
  0.        ]
In [863]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse33 = np.mean(((lasso.predict(X_test_p)).reshape(309,1) -y_test_p)**2)
In [864]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse33)
Out[864]:
0.7293533770527977
In [865]:
print('Training model Accuracy value for Linear Model: {0:0.2f}%'.format((lr_model.score(X_train_p, y_train_p))*100))

print('Testing Model Accuracy value for Linear Model: {0:0.2f}%'.format((lr_model.score(X_test_p, y_test_p))*100))


y_pred_scaled_lr = lr_model.predict(X_test_p)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_p, y = pd.DataFrame(y_pred_scaled_lr), kind = 'reg', color = 'green')
Training model Accuracy value for Linear Model: 76.74%
Testing Model Accuracy value for Linear Model: 69.97%
In [866]:
print('Training model Accuracy value for Ridge Model: {0:0.2f}%'.format((ridge.score(X_train_p, y_train_p))*100))

print('Testing Model Accuracy value for Ridge Model: {0:0.2f}%'.format((ridge.score(X_test_p, y_test_p))*100))


y_pred_scaled_ridge_p = ridge.predict(X_test_p)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_p, y = pd.DataFrame(y_pred_scaled_ridge_p), kind = 'reg', color = 'r')
Training model Accuracy value for Ridge Model: 76.74%
Testing Model Accuracy value for Ridge Model: 69.96%
In [867]:
print('Training model Accuracy value for Lasso Model: {0:0.2f}%'.format((lasso.score(X_train_p, y_train_p))*100))

print('Testing Model Accuracy value for Lasso Model: {0:0.2f}%'.format((lasso.score(X_test_p, y_test_p))*100))


y_pred_scaled_lasso_p = lasso.predict(X_test_p)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_p, y = pd.DataFrame(y_pred_scaled_lasso_p), kind = 'reg', color = 'violet')
Training model Accuracy value for Lasso Model: 38.58%
Testing Model Accuracy value for Lasso Model: 37.63%

Observations:

  • More or less similar results for linear regression and ridge models but lasso has a very low score with only four polynomial features and less complex model , as complexity is a function of variables and coefficients.

  • But in case of Ridge and un-regularised model the dimensions are almost same.

Feature Importance for the individual Features

In [868]:
#ways of dropping variables:
#significance of variables (p-values)
#VIF--variance inflation factor

#Computing VIF
#VIF=1/1-r^2
#create a dataframe which will contain all the features and their respective VIF values
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif=pd.DataFrame()
vif['Features'] = X_train_scaled.columns
vif['VIF'] = [variance_inflation_factor(X_train_scaled.values,i) for i in range(X_train_scaled.shape[1]) ]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
vif
Out[868]:
Features VIF
0 cement 7.49
1 slag 7.28
6 fineagg 6.65
3 water 6.41
2 ash 6.29
5 coarseagg 4.76
4 superplastic 2.71
7 age 1.14

Observations:

  • None of the column are having very high VIF value, so we may not be required to drop any of them. However some composite columns can be formed to check the model performance.

  • Based upon the VIF score of ash & slag, their correlation with target column and the importance in strengthening the concrete we can merge these two columns, and create a composite column.

Decision Tree Regressor:

In [869]:
dtr_model = DecisionTreeRegressor(random_state=0, max_depth=3)

dtr_model.fit(X_train_scaled, y_train_scaled)
y_pred = dtr_model.predict(X_test_scaled)
In [870]:
print('Training model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_train_scaled, y_train_scaled))*100))

print('Testing Model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_test_scaled, y_test_scaled))*100))


sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('dark'):
    sns.jointplot(x = y_test_scaled, y=pd.DataFrame(y_pred), kind = 'reg', color = 'purple')
Training model Accuracy value for Decision Tree Regressor Model: 63.58%
Testing Model Accuracy value for Decision Tree Regressor Model: 58.31%
In [871]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse4 = np.mean((y_pred.reshape(309,1) - y_test_scaled)**2)
In [872]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse4)
Out[872]:
0.6596825058691177
In [873]:
feature_importances = dtr_model.feature_importances_

feature_names = df.columns[0:8]
print(feature_names)

k=8
print(feature_importances)
top_k_idx = (feature_importances.argsort()[-k:][::-1])
print(feature_names[top_k_idx], feature_importances)
Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age'],
      dtype='object')
[0.46405785 0.         0.         0.08815856 0.         0.
 0.         0.44778359]
Index(['cement', 'age', 'water', 'fineagg', 'coarseagg', 'superplastic', 'ash',
       'slag'],
      dtype='object') [0.46405785 0.         0.         0.08815856 0.         0.
 0.         0.44778359]

Observations on Feature Importance:

  • Among all features, columns like cement, water and age are highly important features in case of decision tree regression model.

  • These columns play important role determining the model performance.

  • Here we got a model performance score of 58% which is very low. We will explore further to enhance the model performance

Support Vector Regressor:

In [874]:
from sklearn import svm
clr = svm.SVR()
clr.fit(X_train_scaled, y_train_scaled)
Out[874]:
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
In [875]:
print('Training model Accuracy value for Support Vector Regressor Model: {0:0.2f}%'.format((clr.score(X_train_scaled, y_train_scaled))*100))

print('Testing Model Accuracy value for Support Vector Regressor Model: {0:0.2f}%'.format((clr.score(X_test_scaled, y_test_scaled))*100))



y_pred_clr = clr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_clr), kind = 'reg', color = 'green')
Training model Accuracy value for Support Vector Regressor Model: 87.80%
Testing Model Accuracy value for Support Vector Regressor Model: 82.94%
In [876]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse5 = np.mean(((clr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
In [877]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse5)
Out[877]:
0.42194717792876646

Observations:

  • Here, we got a very good score of 82%, which is the best among all the models.

  • Let's explore a bit more to enhance the model performance.

Random Forest Regressor:

In [878]:
rfr = RandomForestRegressor(n_estimators = 50, random_state = 559, max_features = 8 )
rfr = rfr.fit(X_train_scaled, y_train_scaled)
In [879]:
print('Training model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_train_scaled, y_train_scaled))*100))

print('Testing Model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_test_scaled, y_test_scaled))*100))


y_predict_rfr = rfr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_predict_rfr), kind = 'reg', color = 'brown')
Training model Accuracy value for Random Forest Regressor Model: 98.30%
Testing Model Accuracy value for Random Forest Regressor Model: 90.36%
In [880]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse6 = np.mean(((rfr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
In [881]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse6)
Out[881]:
0.31726725280001633

Observations:

  • Here, we got a very good score of 90.36%, which is the best among all the models.

  • Let's explore a bit more to enhance the model performance.

XGBoost Regressor:

In [882]:
xgbr = xgboost.XGBRegressor()
xgbr.fit(X_train_scaled, y_train_scaled)
Out[882]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)
In [883]:
# Predicting the Test set results

print('Training model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_train_scaled, y_train_scaled))*100))

print('Testing Model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_test_scaled, y_test_scaled))*100))


y_pred_xgbr = xgbr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_xgbr), kind = 'reg', color = 'orange')
Training model Accuracy value for XGBoost Regressor Model: 99.43%
Testing Model Accuracy value for XGBoost Regressor Model: 92.20%
In [884]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse7 = np.mean(((xgbr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
In [885]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse7)
Out[885]:
0.2854185223578459

Observations:

  • Here, we got a very good score of 92.20%, which is the best among all the models.

  • Let's explore a bit more to enhance the model performance.

Gradient Boost Regressor:

In [886]:
gbr =  GradientBoostingRegressor(n_estimators = 50, random_state = 559, max_features = 8 )
gbr.fit(X_train_scaled, y_train_scaled)
Out[886]:
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=8, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=50,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=559, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
In [887]:
# Predicting the Test set results

print('Training model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_train_scaled, y_train_scaled))*100))

print('Testing Model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_test_scaled, y_test_scaled))*100))

y_pred_gbr = gbr.predict(X_test_scaled)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_scaled, y = pd.DataFrame(y_pred_gbr), kind = 'reg', color = 'blue')
Training model Accuracy value for Gradient Boost Regressor Model: 91.53%
Testing Model Accuracy value for Gradient Boost Regressor Model: 86.62%
In [888]:
# Let us check the sum of squared errors by predicting value of y for training cases and 
# subtracting from the actual y for the training cases

mse8 = np.mean(((gbr.predict(X_test_scaled)).reshape(309,1) -y_test_scaled)**2)
In [889]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predicted and actual

math.sqrt(mse8)
Out[889]:
0.3737391244999938

Observations:

  • Here, we got a very good score of 86.61%, which is the best among all the models.

  • Let's explore a bit more to enhance the model performance.

In [890]:
req_col_names = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength']

feature_dtr = dtr_model.feature_importances_
feature_rfr = rfr.feature_importances_

labels = req_col_names[:-1]

x = np.arange(len(labels)) 
width = 0.3

fig, ax = plt.subplots(figsize=(10,6))
rects1 = ax.bar(x-(width/2), feature_dtr, width, label='Decision Tree')
rects2 = ax.bar(x+(width/2), feature_rfr, width, label='Random Forest')

ax.set_ylabel('Importance')
ax.set_xlabel('Features')
ax.set_title('Feature Importance')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45)
ax.legend(loc="upper left", bbox_to_anchor=(1,1))

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{:.2f}'.format(height), xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)

fig.tight_layout()
plt.show()

Observations for feature importance:

  • The columns like Cement and age are highly important columns based on their importance values from tree based models like Decision Tree and Random Forest.

  • Other columns like water is also important based upon the importance values given by two tree based models.

  • So, these three features are important features in terms of calculation of model score and minimisation of cost function.

  • Rest of the columns are also required to calculate the model score.

  • However, As per Decision tree, only cement, water and age are important columns and rest of the columns can be dropped out from calculation. This is what makes the model simpler and prevents it from overfitting.

  • Random forest model has given importance to each of the columns, but it is almost negligible

In [892]:
req_col_names = ['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength']

feature_xgbr = xgbr.feature_importances_
feature_gbr = gbr.feature_importances_

labels = req_col_names[:-1]

x = np.arange(len(labels)) 
width = 0.3

fig, ax = plt.subplots(figsize=(10,6))
rects1 = ax.bar(x-(width/2), feature_dtr, width, label='XGBoost Regressor')
rects2 = ax.bar(x+(width/2), feature_rfr, width, label='Gradient Boost Regressor')

ax.set_ylabel('Importance')
ax.set_xlabel('Features')
ax.set_title('Feature Importance')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=45)
ax.legend(loc="upper left", bbox_to_anchor=(1,1))

def autolabel(rects):
    """Attach a text label above each bar in *rects*, displaying its height."""
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{:.2f}'.format(height), xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 3), textcoords="offset points", ha='center', va='bottom')
autolabel(rects1)
autolabel(rects2)

fig.tight_layout()
plt.show()

Observations for feature importance:

  • The discussion for feature importance is also similar for both XGBoost Regressor and Gradient Boost Regressor.

5.b. Model building using KFold CV

In [893]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

num_folds = 10
seed = 235

kfold = KFold(n_splits = num_folds, random_state = seed)
result_lr = cross_val_score(lr_model, X_scaled, y_scaled, cv = kfold)
print(result_lr)
print('\n')
print('Model Score in itteration 1 for Linear Regression:%.3f%% (%.3f%%)'%(result_lr.mean()*100.0, result_lr.std()*100.0))

print('\n')
result_ridge = cross_val_score(ridge, X_scaled, y_scaled, cv = kfold)
print(result_ridge)
print('\n')
print('Model Score  in itteration 1 for Ridge:%.3f%% (%.3f%%)'%(result_ridge.mean()*100.0, result_ridge.std()*100.0))

print('\n')
result_lasso = cross_val_score(lasso, X_scaled, y_scaled, cv = kfold)
print(result_lasso)
print('\n')
print('Model Score  in itteration 1 for Lasso:%.3f%% (%.3f%%)'%(result_lasso.mean()*100.0, result_lasso.std()*100.0))

print('\n')
result_DTR = cross_val_score(dtr_model, X_scaled, y_scaled, cv = kfold)
print(result_DTR)
print('\n')
print('Model Score  in itteration 1 for Desission Tree Regressor:%.3f%% (%.3f%%)'%(result_DTR.mean()*100.0, result_DTR.std()*100.0))

print('\n')
result_clr = cross_val_score(clr, X_scaled, y_scaled, cv = kfold)
print(result_clr)
print('\n')
print('Model Score  in itteration 1 for Support Vector Regressor:%.3f%% (%.3f%%)'%(result_clr.mean()*100.0, result_clr.std()*100.0))

print('\n')
result_rfr = cross_val_score(rfr, X_scaled, y_scaled, cv = kfold)
print(result_rfr)
print('\n')
print('Model Score  in itteration 1 for Random Forest Regressor:%.3f%% (%.3f%%)'%(result_rfr.mean()*100.0, result_rfr.std()*100.0))

print('\n')
result_xgbr = cross_val_score(xgbr, X_scaled, y_scaled, cv = kfold)
print(result_xgbr)
print('\n')
print('Model Score  in itteration 1 for XGBoost Regressor:%.3f%% (%.3f%%)'%(result_xgbr.mean()*100.0, result_xgbr.std()*100.0))

print('\n')
result_gbr = cross_val_score(gbr, X_scaled, y_scaled, cv = kfold)
print(result_gbr)
print('\n')
print('Model Score  in itteration 1 for Gradient Boosting Regressor:%.3f%% (%.3f%%)'%(result_gbr.mean()*100.0, result_gbr.std()*100.0))
[0.47614371 0.67567905 0.69840899 0.57072063 0.54738709 0.68539964
 0.59863602 0.61651603 0.50396964 0.54624389]


Model Score in itteration 1 for Linear Regression:59.191% (7.308%)


[0.47645386 0.67559931 0.69851367 0.57084948 0.54756787 0.68514082
 0.59855102 0.61645325 0.50379018 0.54631849]


Model Score  in itteration 1 for Ridge:59.192% (7.300%)


[0.41551766 0.32938775 0.43235795 0.34052628 0.2892517  0.37209478
 0.22944998 0.36561236 0.29240362 0.31677507]


Model Score  in itteration 1 for Lasso:33.834% (5.797%)


[0.57487956 0.51803811 0.62886549 0.51395936 0.65347435 0.69471346
 0.66286365 0.66575227 0.50971086 0.58253734]


Model Score  in itteration 1 for Desission Tree Regressor:60.048% (6.644%)


[0.83003812 0.83733469 0.88099832 0.82125079 0.7863025  0.86361926
 0.87568031 0.86103976 0.80961132 0.82391015]


Model Score  in itteration 1 for Support Vector Regressor:83.898% (2.910%)


[0.8757276  0.89030705 0.95701515 0.91619975 0.88700427 0.93466806
 0.93812086 0.92257731 0.91323525 0.92302729]


Model Score  in itteration 1 for Random Forest Regressor:91.579% (2.396%)


[0.89282039 0.89636884 0.97009276 0.94382776 0.91281088 0.95380756
 0.95459902 0.95072415 0.93223956 0.94836257]


Model Score  in itteration 1 for XGBoost Regressor:93.557% (2.498%)


[0.86407698 0.84922588 0.90978198 0.84909457 0.84471524 0.89733176
 0.89758493 0.88778781 0.86342283 0.8781648 ]


Model Score  in itteration 1 for Gradient Boosting Regressor:87.412% (2.211%)

5.c. Observations on Model Score

In [894]:
result_itteration1 = pd.DataFrame({'Algorithm' : ['Linear Regression', 'Ridge', 'Lasso','Deision Tree', 'Support Vector', 
                                      'Random Forest', 'XGBoost', 'GradientBoost'],
                      'Model_score': [lr_model.score(X_test_p, y_test_p)*100.0, ridge.score(X_test_p, y_test_p)*100.0,
                                         lasso.score(X_test_p, y_test_p)*100.0, dtr_model.score(X_test_scaled, y_test_scaled)*100.0,
                                         clr.score(X_test_scaled, y_test_scaled)*100.0, rfr.score(X_test_scaled, y_test_scaled)*100.0,
                                         xgbr.score(X_test_scaled, y_test_scaled)*100.0, gbr.score(X_test_scaled, y_test_scaled)*100.0],
                       
                       
                      'Root Mean Square Error' : [math.sqrt(mse11), math.sqrt(mse22), math.sqrt(mse33), 
                                           math.sqrt(mse4), math.sqrt(mse5), math.sqrt(mse6), 
                                           math.sqrt(mse7), math.sqrt(mse8)],
                                   
                      'Cross_val_score' : [result_lr.mean()*100.0, result_ridge.mean()*100.0, result_lasso.mean()*100.0, 
                                           result_DTR.mean()*100.0, result_clr.mean()*100.0, result_rfr.mean()*100.0, 
                                           result_xgbr.mean()*100.0, result_gbr.mean()*100.0],
                       
                      'Std_Dev':[result_lr.std()*100.0, result_ridge.std()*100.0, result_lasso.std()*100.0,
                                 result_DTR.std()*100.0, result_clr.std()*100.0, result_rfr.std()*100.0, 
                                 result_xgbr.std()*100.0, result_gbr.std()*100.0]})
result_itteration1
Out[894]:
Algorithm Model_score Root Mean Square Error Cross_val_score Std_Dev
0 Linear Regression 69.965042 0.506141 59.191047 7.307688
1 Ridge 69.958537 0.506196 59.192380 7.300019
2 Lasso 37.632265 0.729353 33.833772 5.797047
3 Deision Tree 58.310412 0.659683 60.047945 6.644179
4 Support Vector 82.944132 0.421947 83.897852 2.909932
5 Random Forest 90.357089 0.317267 91.578826 2.396392
6 XGBoost 92.195915 0.285419 93.556535 2.498270
7 GradientBoost 86.618808 0.373739 87.411868 2.211392

  • In Linear Regression Analysis, Cost function is the Root mean Square error, and we need to minimize it. The model which gives minimum RMSE value is the suitable model to go with. So here, XGBoost has the lowest RMSE value, and can be chosen for further analysis. However, model score also plays an important role to decide the accurate model.

  • Among all models XGBoost has maximum score with train_test_split method,

  • Similarly among all the models XGBoost has also maximum cross validation score with a standard deviation of 2.49.

  • Cross validation score of Linear regression, Ridge and Lasso models are less than normal score, this is because, in normal score we have considered the features with the help of polynomial Feature Elimination.

  • However, without adopting polynomial features these three models give a score almost similar to cross validation score.

  • We will try to explore more about the scores and enhance the model performance by addressing the outliers present in the features.

6. Iteration -2: Addressing Mix of Gaussian, Generation of Clusters, Addressing Outliers, Modeling Building

6.a. Determining Mix of Gaussian

  • The data seems to be a mix of Gaussians, thus exploring more from the pair panel visual inspection, and we can expect 4 to 5 clusters from this dataset.

  • Thus, we will be restricting our clusters from 3 to 8.

  • To build the clusters we will be using K-Means Clustering Technique from Unsupervised Learning

6.b. Generation of Clusters by Using K-Means Culstering

In [895]:
cluster_range = range (3,12)
cluster_errors = []
for num_clusters in cluster_range:
    clusters = KMeans (num_clusters, n_init = 5)
    clusters.fit(df)
    labels = clusters.labels_
    centroids = clusters.cluster_centers_
    cluster_errors.append(clusters.inertia_)
clusters_df = pd.DataFrame({'num_clusters':cluster_range, 'cluster_errors': cluster_errors})
clusters_df[0:30]
Out[895]:
num_clusters cluster_errors
0 3 2.499578e+07
1 4 2.198953e+07
2 5 1.953573e+07
3 6 1.783023e+07
4 7 1.640984e+07
5 8 1.466079e+07
6 9 1.347383e+07
7 10 1.248042e+07
8 11 1.148397e+07
  • The above pivot table shows us the cluster error for each cluster

  • Let's draw the Elbow Plot to see the number of clusters

In [896]:
# Elbow Plot

plt.figure(figsize = (12,6))
plt.plot(clusters_df.num_clusters, clusters_df.cluster_errors, marker = 'o');
  • The elbow plot confirms our visual analysis that there are likely 5 or 8 clusters.
  • Let us start with 5 clusters
In [897]:
cluster = KMeans(n_clusters = 8, random_state = 2535)
cluster.fit(df_scaled)

prediction = cluster.predict(df_scaled)
# Creating a new column 'GROUP' which will hold the cluster id of each record
df_scaled['GROUP'] = prediction

# Creating a mirror copy for later re-use instead of building repeatedly 
df_scaled_copy = df_scaled.copy(deep = True)
  • let's see the centroids of these clusters
In [898]:
centroids = cluster.cluster_centers_
centroids
Out[898]:
array([[-5.84357704e-01, -6.52474867e-01,  1.11079472e+00,
        -6.21917608e-01,  3.99539649e-01,  5.90408503e-01,
         4.06974789e-01, -1.22264538e-01, -1.79261899e-01],
       [ 6.36751835e-01,  9.31678239e-01, -6.09935220e-01,
        -5.60205247e-01,  6.84434711e-01, -1.89071674e-01,
        -4.33119682e-01, -1.76103365e-01,  1.14043980e+00],
       [ 4.94141497e-01, -3.83486490e-03, -8.47143932e-01,
         1.66335905e+00, -1.03914281e+00, -1.91547887e-01,
        -1.20565574e+00,  3.91034176e+00,  6.03633443e-01],
       [-8.96869437e-01,  1.36310863e+00, -8.37447314e-01,
         5.67936853e-01, -7.71920331e-01, -3.27458291e-02,
        -1.06243363e-01, -2.46753962e-01, -6.38902094e-01],
       [-4.15036915e-01,  1.87711454e-01,  1.10377802e+00,
         5.75748454e-01,  4.02791016e-01, -1.21882778e+00,
        -3.56859274e-01, -2.99116948e-01, -2.12049391e-01],
       [ 1.71106634e+00, -6.00162481e-01, -7.53600282e-01,
         7.18981498e-01, -8.74959664e-01,  6.24623475e-01,
        -1.55753862e+00, -2.26899745e-02,  6.78308054e-01],
       [ 1.30253532e+00,  8.33213980e-02, -4.64713785e-01,
        -1.45438998e+00,  1.71649497e+00, -1.10015906e+00,
         1.07855248e+00, -1.70054883e-01,  1.14270938e+00],
       [ 3.50516187e-01, -8.56887890e-01, -8.37900730e-01,
         4.49759227e-01, -1.01088094e+00,  5.27767285e-01,
         3.89233375e-01, -1.22936799e-01, -6.99030244e-01]])
  • Instead of interpreting the numerical values of the centroids, let us do a visual analysis by converting the centroids and the data in the cluster into box plots.

6.c. Addressing Outliers

In [899]:
df_scaled.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'orange');

  • Many outliers can be seen on each dimension (indicated by the black circle)
  • Spread of data on each dimension (indicated by the whiskers is long....due to the outliers)
  • If the outliers are addressed, the clusters will overlap much
In [900]:
# Addressing the outliers at group level

def replace(group):
    median, std = group.median(), group.std()  # Get the median and the standard deviation of every group
    outliers = (group - median).abs()> 2*std  # Subtract median from every member of each group. Take absolute values > 2 std. dev
    group[outliers] = group.median()
    return group

df_corrected = (df_scaled.groupby('GROUP').transform(replace))
concat_data = df_corrected.join(pd.DataFrame(df_scaled['GROUP']))
In [901]:
concat_data.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'blue');

  • NOTE: When we remove outliers and replace with median or mean, the distribution shape changes, the standard deviation becomes tighter creating new outliers. The new outliers would be much closer to the centre than original outliers so we accept them without modifying them.

  • By replacing outliers, forcefully we created some new outliers by making the boundary much tighter.

Let us analyze the target column vs other columns group wise

In [902]:
with sns.axes_style('white'):
    plot = sns.lmplot('cement', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Cement vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('ash', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Ash vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('water', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Water vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('slag', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Slag vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('age', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Age vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('fineagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Fine Aggregate vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('coarseagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Coarse Aggregate vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('superplastic', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Super Plastic vs. Strength', color = 'brown');

Observations:

  • The above plots show the relation of target column with independent columns in cluster wise or group wise format.

  • For each independent variables, group of clusters were plotted with their respective best fit lines.

  • The more horizontal a line is, the more week it is in predicting the target column. It means, a horizontal line will only take one value of target column and fix it. So the spread of all data points can't be captured by this line.

  • We can observe from above plots that, in each of the cluster, data are far widen or spread away from the best fit line signifying too large variance in the cluster.

In [903]:
df_corrected.head()
Out[903]:
cement slag ash water superplastic coarseagg fineagg age strength
0 -1.339017 1.601441 -0.847144 1.027590 -1.039143 -0.014398 -0.312970 -0.279733 -0.355018
1 -1.074790 -0.367541 1.096078 -1.090116 0.769617 1.388141 0.282260 -0.501465 -0.737108
2 -0.298384 -0.856888 0.648965 0.273274 -0.118015 -0.206121 1.093371 -0.279733 -0.395144
3 -0.145209 0.465044 -0.847144 0.488793 -1.039143 -0.526517 -1.292542 -0.279733 0.600806
4 -1.209776 1.269798 -0.847144 0.549700 -1.039143 0.958372 -0.959363 -0.279733 -1.049727
In [904]:
X_scaled_corre = df_corrected.drop('strength', axis =1)
y_scaled_corre = df_corrected[['strength']]
In [905]:
# Split X and y into training and test set in 70:30 ratio

X_train_corre, X_test_corre, y_train_corre, y_test_corre = train_test_split(X_scaled_corre, y_scaled_corre, test_size=0.30, random_state=1)

6.d. Various Linear model building and testing with Train_Test_Split method

In [906]:
lr_model.fit(X_train_corre, y_train_corre)
ridge.fit(X_train_corre,y_train_corre)
lasso.fit(X_train_corre,y_train_corre)
dtr_model.fit(X_train_corre,y_train_corre)
clr.fit(X_train_corre,y_train_corre)
rfr = rfr.fit(X_train_corre,y_train_corre)
xgbr.fit(X_train_corre,y_train_corre)
gbr.fit(X_train_corre,y_train_corre)
Out[906]:
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=8, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=50,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=559, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
In [907]:
# Predicting the Test set results

print('Training model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_test_corre, y_test_corre))*100))

y_pred_lr_model = lr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lr_model), kind = 'reg', color = 'blue')
Training model Accuracy value for Linear Regressor Model: 56.50%
Testing Model Accuracy value for Linear Regressor Model: 60.35%
In [908]:
# Predicting the Test set results

print('Training model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_test_corre, y_test_corre))*100))

y_pred_ridge = ridge.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_ridge), kind = 'reg', color = 'green')
Training model Accuracy value for Ridge Regressor Model: 56.50%
Testing Model Accuracy value for Ridge Regressor Model: 60.34%
In [909]:
# Predicting the Test set results

print('Training model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_test_corre, y_test_corre))*100))

y_pred_lasso = lasso.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lasso), kind = 'reg', color = 'red')
Training model Accuracy value for Lasso Regressor Model: 31.64%
Testing Model Accuracy value for Lasso Regressor Model: 30.02%
In [910]:
# Predicting the Test set results

print('Training model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_test_corre, y_test_corre))*100))

y_pred_dtr_model = dtr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_dtr_model), kind = 'reg', color = 'violet')
Training model Accuracy value for Decision Tree Regressor Model: 61.41%
Testing Model Accuracy value for Decision Tree Regressor Model: 52.92%
In [911]:
# Predicting the Test set results

print('Training model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_test_corre, y_test_corre))*100))

y_pred_clr = clr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_clr), kind = 'reg', color = 'purple')
Training model Accuracy value for Support vector Regressor Model: 79.47%
Testing Model Accuracy value for Support vector Regressor Model: 75.98%
In [912]:
# Predicting the Test set results

print('Training model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_test_corre, y_test_corre))*100))

y_pred_rfr = rfr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_rfr), kind = 'reg', color = 'orange')
Training model Accuracy value for Random Forest Regressor Model: 96.18%
Testing Model Accuracy value for Random Forest Regressor Model: 82.79%
  • For Random forest, the training score is coming around 96%, however, with test set, it gives a score of around 83%, which is a sign of overfitting of model.
In [913]:
# Predicting the Test set results

print('Training model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_test_corre, y_test_corre))*100))

y_pred_xgbr = xgbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_xgbr), kind = 'reg', color = 'brown')
Training model Accuracy value for XGBoost Regressor Model: 97.77%
Testing Model Accuracy value for XGBoost Regressor Model: 81.66%
  • For XGBoost, the training score is coming around 97%, however, with test set, it gives a score of around 83%, which is a sign of overfitting of model.
In [914]:
# Predicting the Test set results

print('Training model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_test_corre, y_test_corre))*100))

y_pred_gbr = gbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_gbr), kind = 'reg', color = 'teal')
Training model Accuracy value for Gradient Boost Regressor Model: 86.25%
Testing Model Accuracy value for Gradient Boost Regressor Model: 79.03%

6.e. Various Linear model building and testing with KFold CV

In [915]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

num_folds = 10
seed = 235

kfold = KFold(n_splits = num_folds, random_state = seed)
result_lr = cross_val_score(lr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lr)
print('\n')
print('Model Score in itteration 2 for Linear Regression:%.3f%% (%.3f%%)'%(result_lr.mean()*100.0, result_lr.std()*100.0))

print('\n')
result_ridge = cross_val_score(ridge, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_ridge)
print('\n')
print('Model Score  in itteration 2 for Ridge:%.3f%% (%.3f%%)'%(result_ridge.mean()*100.0, result_ridge.std()*100.0))

print('\n')
result_lasso = cross_val_score(lasso, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lasso)
print('\n')
print('Model Score  in itteration 2 for Lasso:%.3f%% (%.3f%%)'%(result_lasso.mean()*100.0, result_lasso.std()*100.0))

print('\n')
result_DTR = cross_val_score(dtr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_DTR)
print('\n')
print('Model Score  in itteration 2 for Desission Tree Regressor:%.3f%% (%.3f%%)'%(result_DTR.mean()*100.0, result_DTR.std()*100.0))

print('\n')
result_clr = cross_val_score(clr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_clr)
print('\n')
print('Model Score  in itteration 2 for Support Vector Regressor:%.3f%% (%.3f%%)'%(result_clr.mean()*100.0, result_clr.std()*100.0))

print('\n')
result_rfr = cross_val_score(rfr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_rfr)
print('\n')
print('Model Score  in itteration 2 for Random Forest Regressor:%.3f%% (%.3f%%)'%(result_rfr.mean()*100.0, result_rfr.std()*100.0))

print('\n')
result_xgbr = cross_val_score(xgbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_xgbr)
print('\n')
print('Model Score  in itteration 2 for XGBoost Regressor:%.3f%% (%.3f%%)'%(result_xgbr.mean()*100.0, result_xgbr.std()*100.0))

print('\n')
result_gbr = cross_val_score(gbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_gbr)
print('\n')
print('Model Score  in itteration 2 for Gradient Boosting Regressor:%.3f%% (%.3f%%)'%(result_gbr.mean()*100.0, result_gbr.std()*100.0))
[0.52852254 0.633293   0.67389288 0.50734917 0.56775448 0.62661129
 0.47089744 0.52958011 0.48391658 0.53447629]


Model Score in itteration 2 for Linear Regression:55.563% (6.459%)


[0.52860245 0.63326292 0.67389387 0.50734495 0.5677508  0.62656857
 0.47091128 0.52953951 0.48400378 0.53448209]


Model Score  in itteration 2 for Ridge:55.564% (6.457%)


[0.3889476  0.27002654 0.38395527 0.30562276 0.29245903 0.32761022
 0.09291314 0.31574856 0.26160911 0.30383439]


Model Score  in itteration 2 for Lasso:29.427% (7.810%)


[0.54098708 0.46927227 0.58210943 0.40628945 0.61412656 0.62645254
 0.42595276 0.56212265 0.39750073 0.52776389]


Model Score  in itteration 2 for Desission Tree Regressor:51.526% (8.090%)


[0.68096118 0.76386653 0.84985263 0.65031512 0.70892012 0.79667431
 0.7750984  0.76038143 0.7551561  0.70178081]


Model Score  in itteration 2 for Support Vector Regressor:74.430% (5.612%)


[0.79172685 0.8255707  0.90379613 0.79932751 0.82762708 0.85412202
 0.84951345 0.81388824 0.83421864 0.88755529]


Model Score  in itteration 2 for Random Forest Regressor:83.873% (3.416%)


[0.79214659 0.81798065 0.89225605 0.785936   0.81963155 0.86220363
 0.8544956  0.82732139 0.82148009 0.90417071]


Model Score  in itteration 2 for XGBoost Regressor:83.776% (3.757%)


[0.77538277 0.80302958 0.86439083 0.76252767 0.74749736 0.83570068
 0.81051391 0.7984935  0.77315682 0.80347565]


Model Score  in itteration 2 for Gradient Boosting Regressor:79.742% (3.313%)
In [916]:
mse_i2_1 = np.mean((y_pred_lr_model - y_test_scaled)**2)
rmse_i2_1 = math.sqrt(mse_i2_1)

mse_i2_2 = np.mean((y_pred_ridge - y_test_scaled)**2)
rmse_i2_2 = math.sqrt(mse_i2_2)

mse_i2_3 = np.mean((y_pred_lasso.reshape(309,1) - y_test_p)**2)
rmse_i2_3 = math.sqrt(mse_i2_3)

mse_i2_4 = np.mean((y_pred_dtr_model.reshape(309,1) - y_test_scaled)**2)
rmse_i2_4 = math.sqrt(mse_i2_4)

mse_i2_5 = np.mean((y_pred_clr.reshape(309,1) - y_test_scaled)**2)
rmse_i2_5 = math.sqrt(mse_i2_5)

mse_i2_6 = np.mean((y_pred_rfr.reshape(309,1) -y_test_scaled)**2)
rmse_i2_6 = math.sqrt(mse_i2_6)

mse_i2_7 = np.mean((y_pred_xgbr.reshape(309,1) -y_test_scaled)**2)
rmse_i2_7 = math.sqrt(mse_i2_7)

mse_i2_8 = np.mean((y_pred_gbr.reshape(309,1) -y_test_scaled)**2)
rmse_i2_8 = math.sqrt(mse_i2_8)
In [917]:
result_itteration2 = pd.DataFrame({'Algorithm' : ['Linear Regression', 'Ridge', 'Lasso','Deision Tree', 'Support Vector', 
                                      'Random Forest', 'XGBoost', 'GradientBoost'],
                      'Model_score': [lr_model.score(X_test_corre, y_test_corre)*100.0, ridge.score(X_test_corre, y_test_corre)*100.0,
                                         lasso.score(X_test_corre, y_test_corre)*100.0, dtr_model.score(X_test_corre, y_test_corre)*100.0,
                                         clr.score(X_test_corre, y_test_corre)*100.0, rfr.score(X_test_corre, y_test_corre)*100.0,
                                         xgbr.score(X_test_corre, y_test_corre)*100.0, gbr.score(X_test_corre, y_test_corre)*100.0],
                    
                    'Root Mean Square Error' : [rmse_i2_1, rmse_i2_2, rmse_i2_3, 
                                           rmse_i2_4, rmse_i2_5, rmse_i2_6, 
                                           rmse_i2_7, rmse_i2_8],
                                   
                      'Cross_val_score' : [result_lr.mean()*100.0, result_ridge.mean()*100.0, result_lasso.mean()*100.0, 
                                           result_DTR.mean()*100.0, result_clr.mean()*100.0, result_rfr.mean()*100.0, 
                                           result_xgbr.mean()*100.0, result_gbr.mean()*100.0],
                       
                      'Std_Dev':[result_lr.std()*100.0, result_ridge.std()*100.0, result_lasso.std()*100.0,
                                 result_DTR.std()*100.0, result_clr.std()*100.0, result_rfr.std()*100.0, 
                                 result_xgbr.std()*100.0, result_gbr.std()*100.0]})
result_itteration2
Out[917]:
Algorithm Model_score Root Mean Square Error Cross_val_score Std_Dev
0 Linear Regression 60.347237 0.670540 55.562938 6.458897
1 Ridge 60.336035 0.670653 55.563602 6.456769
2 Lasso 30.019455 0.948238 29.427266 7.810435
3 Deision Tree 52.918005 0.688565 51.525774 8.090367
4 Support Vector 75.984861 0.518199 74.430066 5.612207
5 Random Forest 82.793223 0.414309 83.873459 3.415621
6 XGBoost 81.663114 0.411717 83.776223 3.757386
7 GradientBoost 79.034399 0.466321 79.741688 3.312904

6.f. Observations on Model Score

  • In Linear Regression Analysis, Cost function is the Root mean Square error, and we need to minimize it. The model which gives minimum RMSE value is the suitable model to go with. So here, Random Forest has the lowest RMSE value, and can be chosen for further analysis. However, model score also plays an important role to decide the accurate model.
  • It can be observed that, after addressing outliers for all the features by imputing the respective median value of each column, the score of the models has decreased.

  • However, among all models Radom Forest Repressor has maximum score with train_test_split method,

  • and XGBoost model has also maximum cross validation score with a standard deviation of 3.316.

  • We will try to explore more about the scores and enhance the model performance by creating some composite features and again addressing the outliers present in the features with different number of clusters.

7. Iteration -3: Composite Feature Creation, Generation of Cluster, Outlier Detection & Treatment, Model Building & Testing

7.a. Composite Feature Creation

In [918]:
df_attr = df.iloc[:, 0:10]
df_attr['cor_fine_agg'] = df_attr['coarseagg'] + df_attr['fineagg']
df_attr['ash_slag'] = df_attr['ash'] + df_attr['slag']
In [919]:
sns.scatterplot(df_attr.ash_slag, df_attr.strength,color = 'brown');
plt.title('Variation of strength with respect to both ash and slag', color = 'brown');
In [920]:
sns.scatterplot(df_attr.cor_fine_agg, df_attr.strength,color = 'purple');
plt.title('Variation of strength with respect to both fine and coarse aggeregate', color = 'brown');
In [921]:
corr_matrix = df_attr.corr()
corr_matrix['strength'].sort_values(ascending = False)
Out[921]:
strength        1.000000
cement          0.497832
superplastic    0.366079
age             0.328873
slag            0.134829
ash_slag        0.054507
ash            -0.105755
coarseagg      -0.164935
fineagg        -0.167241
cor_fine_agg   -0.259130
water          -0.289633
Name: strength, dtype: float64
In [922]:
df_attr = df_attr.apply(zscore)

7.b. Generation of Cluster and cluster error determination

In [923]:
cluster_range = range (3,12)
cluster_errors = []
for num_clusters in cluster_range:
    clusters = KMeans (num_clusters, n_init = 5)
    clusters.fit(df_attr)
    labels = clusters.labels_
    centroids = clusters.cluster_centers_
    cluster_errors.append(clusters.inertia_)
clusters_df_attr = pd.DataFrame({'num_clusters':cluster_range, 'cluster_errors': cluster_errors})
clusters_df_attr[0:30]
Out[923]:
num_clusters cluster_errors
0 3 7794.249536
1 4 6654.025773
2 5 5824.381927
3 6 5219.969152
4 7 4892.345033
5 8 4637.918717
6 9 4344.915462
7 10 4132.931803
8 11 3883.332358
In [924]:
# Elbow Plot

plt.figure(figsize = (12,6))
plt.plot(clusters_df_attr.num_clusters, clusters_df_attr.cluster_errors, marker = 'o');
  • The elbow plot confirms our visual analysis that there are likely 7 or 8 clusters.
  • Let us start with 6 clusters
In [925]:
cluster = KMeans(n_clusters = 6, random_state = 2535)
cluster.fit(df_attr)

prediction = cluster.predict(df_attr)
# Creating a new column 'GROUP' which will hold the cluster id of each record
df_attr['GROUP'] = prediction

# Creating a mirror copy for later re-use instead of building repeatedly 
df_attr_copy = df_attr.copy(deep = True)
  • let's see the centroids of these clusters
In [926]:
centroids = cluster.cluster_centers_
centroids
Out[926]:
array([[ 0.64389474, -0.85389926, -0.83755442,  0.40643963, -0.99899994,
         0.64947697,  0.06576272, -0.1407015 , -0.45837557,  0.55091505,
        -1.42597624],
       [-0.31663861,  0.21495013,  1.10546986,  0.53859061,  0.47538935,
        -1.23845118, -0.44296501, -0.29389134, -0.09389414, -1.30201389,
         1.00042557],
       [ 1.04640316,  0.46704713, -0.55724134, -0.99813053,  1.1358355 ,
        -0.64014308,  0.2697471 , -0.17801788,  1.2016532 , -0.27802996,
         0.05192725],
       [-0.81098426,  1.39463645, -0.79894863,  0.51984884, -0.68064463,
        -0.02681648, -0.23385544, -0.25068167, -0.52813078, -0.20580468,
         0.77528833],
       [ 0.70615294, -0.08941103, -0.84714393,  1.72076737, -1.03914281,
        -0.16825719, -1.42174666,  2.90474577,  0.59141048, -1.2552215 ,
        -0.69384712],
       [-0.59939214, -0.64372482,  1.07069016, -0.57715316,  0.37883586,
         0.58215052,  0.4333687 , -0.12217789, -0.20358278,  0.7903404 ,
         0.14543585]])
  • Instead of interpreting the neumerical values of the centroids, let us do a visual analysis by converting the centroids and the data in the cluster into box plots.
In [927]:
df_attr.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'orange');

Observations:

  • Many outliers can be seen on each dimension (indicated by the black circle)
  • Spread of data on each dimension (indicated by the whiskers is long....due to the outliers)
  • If the outliers are addressed, the clusters will overlap much

7.c. Detection of Outliers and Treatment

In [928]:
# Addressing the outliers at group level

def replace(group):
    median, std = group.median(), group.std()  # Get the median and the standard deviation of every group
    outliers = (group - median).abs()> 2*std  # Subtract median from every member of each group. Take absolute values > 2 std. dev
    group[outliers] = group.median()
    return group

df_corrected = (df_attr.groupby('GROUP').transform(replace))
concat_data = df_corrected.join(pd.DataFrame(df_attr['GROUP']))
In [929]:
concat_data.boxplot(by = 'GROUP', layout = (3,4), figsize = (15,20), color = 'blue');

Observations:

  • NOTE: When we remove outliers and replace with median or mean, the distribution shape changes, the standard deviation becomes tighter creating new outliers. The new outliers would be much closer to the centre than original outliers so we accept them without modifying them.

  • By replacing outliers, forcefully we created some new outliers by making the boundary much tighter.

Let us analyze the target column vs other columns group wise

In [930]:
with sns.axes_style('white'):
    plot = sns.lmplot('cement', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Cement vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('ash', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Ash vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('water', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Water vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('slag', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Slag vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('age', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Age vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('fineagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Fine Aggregate vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('coarseagg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Coarse Aggregate vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('superplastic', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('Super Plastic vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('cor_fine_agg', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('cor_fine_agg vs. Strength', color = 'brown');


with sns.axes_style('white'):
    plot = sns.lmplot('ash_slag', 'strength', data = concat_data, hue = 'GROUP')
plot.set(ylim = (-3,3));
plt.title('ash_slag vs. Strength', color = 'brown');

Observations:

  • The above plots show the relation of target column with independent columns in a cluster wise or group wise format.

  • Here, we have made 6 clusters based upon the elbow plot.

  • And, for each independent variables, group of clusters were plotted with their respective best fit lines.

  • The more horizontal a line is, the more week it is in predicting the target column. It means, a horizontal line will only take one value of target column and fix it. So the spread of all data points can't be captured by this line.

  • We can observe from above plots that, in each of the cluster, data are far widen or spread away from the best fit line signifying too large variance in the cluster.

7.d. Modeling Building & Testing using Train_Test_Split method

In [931]:
df_corrected.head()
Out[931]:
cement slag ash water superplastic coarseagg fineagg age strength cor_fine_agg ash_slag
0 -1.339017 1.601441 -0.847144 1.027590 -1.039143 -0.014398 -0.312970 -0.279733 -0.355018 -0.258923 0.940642
1 -1.074790 -0.367541 1.096078 -1.090116 0.769617 1.388141 0.282260 -0.501465 -0.737108 1.289709 0.430617
2 -0.298384 -0.856888 0.648965 0.273274 -0.118015 -0.206121 1.093371 -0.279733 -0.395144 0.707613 -0.363006
3 -0.145209 0.465044 -0.847144 0.488793 -1.039143 -0.526517 -1.292542 -0.279733 0.600806 -1.428057 -0.157875
4 -1.209776 1.269798 -0.847144 0.549700 -1.039143 0.958372 -0.959363 -0.279733 -1.049727 -0.023713 0.620055
In [932]:
X_scaled_corre = df_corrected.drop('strength', axis =1)
y_scaled_corre = df_corrected[['strength']]
In [933]:
# Split X and y into training and test set in 70:30 ratio

X_train_corre, X_test_corre, y_train_corre, y_test_corre = train_test_split(X_scaled_corre, y_scaled_corre, test_size=0.30, random_state=1)
In [934]:
lr_model.fit(X_train_corre, y_train_corre)
ridge.fit(X_train_corre,y_train_corre)
lasso.fit(X_train_corre,y_train_corre)
dtr_model.fit(X_train_corre,y_train_corre)
clr.fit(X_train_corre,y_train_corre)
rfr = rfr.fit(X_train_corre,y_train_corre)
xgbr.fit(X_train_corre,y_train_corre)
gbr.fit(X_train_corre,y_train_corre)
Out[934]:
GradientBoostingRegressor(alpha=0.9, ccp_alpha=0.0, criterion='friedman_mse',
                          init=None, learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=8, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=50,
                          n_iter_no_change=None, presort='deprecated',
                          random_state=559, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
In [935]:
# Predicting the Test set results

print('Training model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Linear Regressor Model: {0:0.2f}%'.format((lr_model.score(X_test_corre, y_test_corre))*100))

y_pred_lr_model = lr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lr_model), kind = 'reg', color = 'green')
Training model Accuracy value for Linear Regressor Model: 54.35%
Testing Model Accuracy value for Linear Regressor Model: 56.61%
In [936]:
# Predicting the Test set results

print('Training model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Ridge Regressor Model: {0:0.2f}%'.format((ridge.score(X_test_corre, y_test_corre))*100))

y_pred_ridge = ridge.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_ridge), kind = 'reg', color = 'red')
Training model Accuracy value for Ridge Regressor Model: 54.35%
Testing Model Accuracy value for Ridge Regressor Model: 56.61%
In [937]:
# Predicting the Test set results

print('Training model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Lasso Regressor Model: {0:0.2f}%'.format((lasso.score(X_test_corre, y_test_corre))*100))

y_pred_lasso = lasso.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_lasso), kind = 'reg', color = 'purple')
Training model Accuracy value for Lasso Regressor Model: 27.37%
Testing Model Accuracy value for Lasso Regressor Model: 25.88%
In [938]:
# Predicting the Test set results

print('Training model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Decision Tree Regressor Model: {0:0.2f}%'.format((dtr_model.score(X_test_corre, y_test_corre))*100))

y_pred_dtr_model = dtr_model.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_dtr_model), kind = 'reg', color = 'blue')
Training model Accuracy value for Decision Tree Regressor Model: 62.21%
Testing Model Accuracy value for Decision Tree Regressor Model: 54.50%
In [939]:
# Predicting the Test set results

print('Training model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Support vector Regressor Model: {0:0.2f}%'.format((clr.score(X_test_corre, y_test_corre))*100))

y_pred_clr = clr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_clr), kind = 'reg', color = 'brown')
Training model Accuracy value for Support vector Regressor Model: 79.75%
Testing Model Accuracy value for Support vector Regressor Model: 72.14%
In [940]:
# Predicting the Test set results

print('Training model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Random Forest Regressor Model: {0:0.2f}%'.format((rfr.score(X_test_corre, y_test_corre))*100))

y_pred_rfr = rfr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_rfr), kind = 'reg', color = 'teal')
Training model Accuracy value for Random Forest Regressor Model: 96.06%
Testing Model Accuracy value for Random Forest Regressor Model: 83.62%
  • For Random forest, the training score is coming around 96%, however, with test set, it gives a score of around 85%, which is a sign of overfitting of model.
In [941]:
# Predicting the Test set results

print('Training model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for XGBoost Regressor Model: {0:0.2f}%'.format((xgbr.score(X_test_corre, y_test_corre))*100))

y_pred_xgbr = xgbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_xgbr), kind = 'reg', color = 'orange')
Training model Accuracy value for XGBoost Regressor Model: 97.54%
Testing Model Accuracy value for XGBoost Regressor Model: 82.23%
  • For XGBoost, the training score is coming around 97%, however, with test set, it gives a score of around 82%, which is a sign of overfitting of model.
In [942]:
# Predicting the Test set results

print('Training model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_train_corre,y_train_corre))*100))

print('Testing Model Accuracy value for Gradient Boost Regressor Model: {0:0.2f}%'.format((gbr.score(X_test_corre, y_test_corre))*100))

y_pred_gbr = gbr.predict(X_test_corre)
sns.set(style = 'darkgrid', color_codes = True)

with sns.axes_style('white'):
    sns.jointplot(x = y_test_corre, y = pd.DataFrame(y_pred_gbr), kind = 'reg', color = 'brown')
Training model Accuracy value for Gradient Boost Regressor Model: 86.63%
Testing Model Accuracy value for Gradient Boost Regressor Model: 79.51%
In [943]:
mse_i3_1 = np.mean((y_pred_lr_model - y_test_scaled)**2)
rmse_i3_1 = math.sqrt(mse_i3_1)

mse_i3_2 = np.mean((y_pred_ridge - y_test_scaled)**2)
rmse_i3_2 = math.sqrt(mse_i3_2)

mse_i3_3 = np.mean((y_pred_lasso.reshape(309,1) - y_test_p)**2)
rmse_i3_3 = math.sqrt(mse_i3_3)

mse_i3_4 = np.mean((y_pred_dtr_model.reshape(309,1) - y_test_scaled)**2)
rmse_i3_4 = math.sqrt(mse_i3_4)

mse_i3_5 = np.mean((y_pred_clr.reshape(309,1) - y_test_scaled)**2)
rmse_i3_5 = math.sqrt(mse_i3_5)

mse_i3_6 = np.mean((y_pred_rfr.reshape(309,1) -y_test_scaled)**2)
rmse_i3_6 = math.sqrt(mse_i3_6)

mse_i3_7 = np.mean((y_pred_xgbr.reshape(309,1) -y_test_scaled)**2)
rmse_i3_7 = math.sqrt(mse_i3_7)

mse_i3_8 = np.mean((y_pred_gbr.reshape(309,1) -y_test_scaled)**2)
rmse_i3_8 = math.sqrt(mse_i3_8)

7.e. Modeling Building & Testing using KFold CV

In [944]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

num_folds = 10
seed = 235

kfold = KFold(n_splits = num_folds, random_state = seed)
result_lr = cross_val_score(lr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lr)
print('\n')
print('Model Score in itteration 2 for Linear Regression:%.3f%% (%.3f%%)'%(result_lr.mean()*100.0, result_lr.std()*100.0))

print('\n')
result_ridge = cross_val_score(ridge, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_ridge)
print('\n')
print('Model Score  in itteration 2 for Ridge:%.3f%% (%.3f%%)'%(result_ridge.mean()*100.0, result_ridge.std()*100.0))

print('\n')
result_lasso = cross_val_score(lasso, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_lasso)
print('\n')
print('Model Score  in itteration 2 for Lasso:%.3f%% (%.3f%%)'%(result_lasso.mean()*100.0, result_lasso.std()*100.0))

print('\n')
result_DTR = cross_val_score(dtr_model, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_DTR)
print('\n')
print('Model Score  in itteration 2 for Desission Tree Regressor:%.3f%% (%.3f%%)'%(result_DTR.mean()*100.0, result_DTR.std()*100.0))

print('\n')
result_clr = cross_val_score(clr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_clr)
print('\n')
print('Model Score  in itteration 2 for Support Vector Regressor:%.3f%% (%.3f%%)'%(result_clr.mean()*100.0, result_clr.std()*100.0))

print('\n')
result_rfr = cross_val_score(rfr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_rfr)
print('\n')
print('Model Score  in itteration 2 for Random Forest Regressor:%.3f%% (%.3f%%)'%(result_rfr.mean()*100.0, result_rfr.std()*100.0))

print('\n')
result_xgbr = cross_val_score(xgbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_xgbr)
print('\n')
print('Model Score  in itteration 2 for XGBoost Regressor:%.3f%% (%.3f%%)'%(result_xgbr.mean()*100.0, result_xgbr.std()*100.0))

print('\n')
result_gbr = cross_val_score(gbr, X_scaled_corre, y_scaled_corre, cv = kfold)
print(result_gbr)
print('\n')
print('Model Score  in itteration 2 for Gradient Boosting Regressor:%.3f%% (%.3f%%)'%(result_gbr.mean()*100.0, result_gbr.std()*100.0))
[0.40556568 0.6336184  0.61370546 0.49849469 0.51109633 0.63127848
 0.44755984 0.48874452 0.47407289 0.52263903]


Model Score in itteration 2 for Linear Regression:52.268% (7.482%)


[0.40569166 0.63357086 0.61371101 0.49852536 0.51105895 0.63126003
 0.44752763 0.48881376 0.47415038 0.5226717 ]


Model Score  in itteration 2 for Ridge:52.270% (7.478%)


[0.28017746 0.24116302 0.30737279 0.23391971 0.2387905  0.29450529
 0.03882279 0.27282197 0.2779722  0.26179282]


Model Score  in itteration 2 for Lasso:24.473% (7.239%)


[0.54229289 0.5924053  0.53367243 0.46407772 0.60991691 0.62430938
 0.55914426 0.62933133 0.51892832 0.58936896]


Model Score  in itteration 2 for Desission Tree Regressor:56.634% (4.984%)


[0.55875337 0.81256267 0.8249003  0.67933844 0.70845101 0.81285916
 0.74896874 0.78628104 0.71528369 0.68163555]


Model Score  in itteration 2 for Support Vector Regressor:73.290% (7.813%)


[0.76497507 0.87040402 0.87132334 0.82384879 0.85258661 0.89485503
 0.78597613 0.84227352 0.85625441 0.87279668]


Model Score  in itteration 2 for Random Forest Regressor:84.353% (3.886%)


[0.7705129  0.84387237 0.88349232 0.78166851 0.84767264 0.86213374
 0.77830545 0.85415031 0.84519634 0.88566982]


Model Score  in itteration 2 for XGBoost Regressor:83.527% (4.072%)


[0.72706855 0.83394185 0.84203257 0.78115652 0.77791998 0.8306237
 0.79927261 0.80692833 0.77690586 0.7674624 ]


Model Score  in itteration 2 for Gradient Boosting Regressor:79.433% (3.367%)

7.f. Observations on Model Score

In [945]:
result_itteration3 = pd.DataFrame({'Algorithm' : ['Linear Regression', 'Ridge', 'Lasso','Deision Tree', 'Support Vector', 
                                      'Random Forest', 'XGBoost', 'GradientBoost'],
                      'Model_score': [lr_model.score(X_test_corre, y_test_corre)*100.0, ridge.score(X_test_corre, y_test_corre)*100.0,
                                         lasso.score(X_test_corre, y_test_corre)*100.0, dtr_model.score(X_test_corre, y_test_corre)*100.0,
                                         clr.score(X_test_corre, y_test_corre)*100.0, rfr.score(X_test_corre, y_test_corre)*100.0,
                                         xgbr.score(X_test_corre, y_test_corre)*100.0, gbr.score(X_test_corre, y_test_corre)*100.0],
                                   
                     'Root Mean Square Error' : [rmse_i3_1, rmse_i3_2, rmse_i3_3, 
                                           rmse_i3_4, rmse_i3_5, rmse_i3_6, 
                                           rmse_i3_7, rmse_i3_8],
                       
                      'Cross_val_score' : [result_lr.mean()*100.0, result_ridge.mean()*100.0, result_lasso.mean()*100.0, 
                                           result_DTR.mean()*100.0, result_clr.mean()*100.0, result_rfr.mean()*100.0, 
                                           result_xgbr.mean()*100.0, result_gbr.mean()*100.0],
                       
                      'Std_Dev':[result_lr.std()*100.0, result_ridge.std()*100.0, result_lasso.std()*100.0,
                                 result_DTR.std()*100.0, result_clr.std()*100.0, result_rfr.std()*100.0, 
                                 result_xgbr.std()*100.0, result_gbr.std()*100.0]})
result_itteration3
Out[945]:
Algorithm Model_score Root Mean Square Error Cross_val_score Std_Dev
0 Linear Regression 56.613860 0.698418 52.267753 7.481897
1 Ridge 56.608356 0.698479 52.269813 7.478485
2 Lasso 25.878016 0.944048 24.473386 7.239213
3 Deision Tree 54.498610 0.704209 56.634475 4.983943
4 Support Vector 72.140254 0.553305 73.290340 7.813029
5 Random Forest 83.620314 0.465409 84.352936 3.886086
6 XGBoost 82.231836 0.470004 83.526744 4.072049
7 GradientBoost 79.507972 0.495998 79.433124 3.367108

Observations:

  • In Linear Regression Analysis, Cost function is the Root mean Square error, and we need to minimize it. The model which gives minimum RMSE value is the suitable model to go with. So here, Random Forest has the lowest RMSE value, and can be chosen for further analysis. However, model score also plays an important role to decide the accurate model.

  • It can be observed that, after generating two composite features and addressing outliers for all the features by imputing the respective median value of each column, the score of the models has not increased, rather it got affected further.

  • However, among all models Random Forest Repressor has maximum score with train_test_split method,

  • And it has also maximum cross validation score with a standard deviation of 3.886.

8. Iteration -4: Model Tuning using GridSearch CV and RandomSearch CV

8.a. Best Model suitbale for this Project

Observations:

  • Based upon the RMSE value, Model prediction score and CV score, out of non-regularised model, regularised model and other regression model, only Random Forest Repressor and XGBoost repressor gave a decent result. And among all three iterations, we got a good score in iteration 1.

  • Thus we will go with the result of 1st iteration and the processes we have followed up for this.

  • Now we will further try to explore more on the model performance by considering only two best models whose mean scores are nearly equal and also highest among all.

  • In this iteration, we will tuning the model by following feature engineering process.

  • We will take XGBoost and Random Forest Regression models for model tuning and will check the score.

  • Model tuning, here we are going to use Grid Search CV and Randomized Search CV.

8.b. Model Tuning using Grid Search CV

Random Forest Regressor:

In [946]:
from sklearn.model_selection import GridSearchCV

parameters_rfr = {'bootstrap':[True],
             'max_depth': [10,20,30,40,50],
             'max_features': ['auto', 'sqrt'],
             'min_samples_leaf': [1,2,4,8],
             'n_estimators': [100]}

clf = GridSearchCV(RandomForestRegressor(), parameters_rfr, cv = 5, verbose = 2, n_jobs=4)
clf.fit(X_scaled, y_scaled)

clf.best_params_
Fitting 5 folds for each of 40 candidates, totalling 200 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:  1.6min
[Parallel(n_jobs=4)]: Done 200 out of 200 | elapsed:  1.9min finished
Out[946]:
{'bootstrap': True,
 'max_depth': 30,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'n_estimators': 100}
In [965]:
rfr_gdcv = RandomForestRegressor(bootstrap=True,
max_depth=30,
max_features= 'auto',
min_samples_leaf= 1,
n_estimators=100)

rfr_gdcv.fit(X_train_scaled, y_train_scaled)

rfr_gdcv_score = cross_val_score(rfr_gdcv, X_test_scaled, y_test_scaled, cv=5)

print('Model Score in itteration 4 for Random Forest Regressor using GridSearch CV:%.3f%% (%.3f%%)'%(rfr_gdcv_score.mean()*100.0, rfr_gdcv_score.std()*100.0))
Model Score in itteration 4 for Random Forest Regressor using GridSearch CV:84.428% (3.314%)

XGBoost Regressor:

In [948]:
parameters_xgbr = {'gamma':[0.1,0.2,0.3,0.4],
             'max_depth':[3,None],
             'min_child_weight':[0,3,4,9],
              'num_parallel_tree': [0,4,6,9],
              'colsample_bylevel': [0,3,4,8],
             'colsample_bynode': [0,3,7,9],
             'colsample_bytree': [0,2,4,6],
                   'n_estimators':[50,80,100]}


gscv_xgbr = GridSearchCV(xgbr, parameters_xgbr, cv = 5, verbose = 2, n_jobs=4)
gscv_xgbr.fit(X_train_scaled, y_train_scaled)

gscv_xgbr.best_params_
Fitting 5 folds for each of 24576 candidates, totalling 122880 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:   18.0s
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:   26.4s
[Parallel(n_jobs=4)]: Done 357 tasks      | elapsed:   50.0s
[Parallel(n_jobs=4)]: Done 640 tasks      | elapsed:  1.3min
[Parallel(n_jobs=4)]: Done 1005 tasks      | elapsed:  2.0min
[Parallel(n_jobs=4)]: Done 1450 tasks      | elapsed:  2.8min
[Parallel(n_jobs=4)]: Done 2128 tasks      | elapsed:  3.6min
[Parallel(n_jobs=4)]: Done 6984 tasks      | elapsed:  4.0min
[Parallel(n_jobs=4)]: Done 12496 tasks      | elapsed:  4.6min
[Parallel(n_jobs=4)]: Done 18648 tasks      | elapsed:  5.2min
[Parallel(n_jobs=4)]: Done 25456 tasks      | elapsed:  5.9min
[Parallel(n_jobs=4)]: Done 32904 tasks      | elapsed:  6.7min
[Parallel(n_jobs=4)]: Done 41008 tasks      | elapsed:  7.5min
[Parallel(n_jobs=4)]: Done 49752 tasks      | elapsed:  8.4min
[Parallel(n_jobs=4)]: Done 59152 tasks      | elapsed:  9.4min
[Parallel(n_jobs=4)]: Done 69192 tasks      | elapsed: 10.5min
[Parallel(n_jobs=4)]: Done 79888 tasks      | elapsed: 11.5min
[Parallel(n_jobs=4)]: Done 91224 tasks      | elapsed: 12.7min
[Parallel(n_jobs=4)]: Done 103216 tasks      | elapsed: 14.0min
[Parallel(n_jobs=4)]: Done 115848 tasks      | elapsed: 15.2min
[Parallel(n_jobs=4)]: Done 122880 out of 122880 | elapsed: 15.9min finished
Out[948]:
{'colsample_bylevel': 0,
 'colsample_bynode': 0,
 'colsample_bytree': 0,
 'gamma': 0.1,
 'max_depth': 3,
 'min_child_weight': 0,
 'n_estimators': 100,
 'num_parallel_tree': 9}
In [949]:
xgbr_gscv = xgboost.XGBRegressor(colsample_bylevel=0,
 colsample_bynode= 0,
 colsample_bytree= 0,
 gamma= 0.1,
 max_depth= 3,
 min_child_weight= 0,
 n_estimators= 100,
 num_parallel_tree= 9)

xgbr_gscv.fit(X_train_scaled, y_train_scaled)

xgbr_gscv_score = cross_val_score(xgbr_gscv, X_test_scaled, y_test_scaled, cv=5)

print('Model Score  in itteration 4 for XGBoost Regressor using RandomSearch CV:%.3f%% (%.3f%%)'%(xgbr_gscv_score.mean()*100.0, xgbr_gscv_score.std()*100.0))
Model Score  in itteration 4 for XGBoost Regressor using RandomSearch CV:81.877% (2.613%)

8.c. Model Tuning using Randomized Search CV

Random Forest Regressor:

In [950]:
parameter_rfr_rcv = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              }
In [951]:
# run randomized search
from sklearn.model_selection import RandomizedSearchCV
samples = 10    # number of random samples
rscv =  RandomizedSearchCV(rfr, param_distributions =parameter_rfr_rcv, n_iter=samples)
In [952]:
rscv.fit(X_train_scaled, y_train_scaled)
print(rscv.best_params_)
{'bootstrap': True, 'max_depth': None, 'max_features': 8, 'min_samples_leaf': 2, 'min_samples_split': 3}
In [953]:
rfr_rscv = RandomForestRegressor(bootstrap= False, max_depth= None, 
                                 max_features= 2, min_samples_leaf= 1, 
                                 min_samples_split= 4)

rfr_rscv.fit(X_train_scaled, y_train_scaled)

rfr_rscv_score = cross_val_score(rfr_rscv, X_test_scaled, y_test_scaled, cv=5)

print('Model Score  in itteration 4 for Random Forest Regressor using RandomSearch CV:%.3f%% (%.3f%%)'%(rfr_rscv_score.mean()*100.0, rfr_rscv_score.std()*100.0))
Model Score  in itteration 4 for Random Forest Regressor using RandomSearch CV:82.035% (4.088%)

XGBoost Regressor:

In [954]:
parameters_xgbr_rscv = {'gamma':[0.1],
             'max_depth':sp_randint(1, 11),
             'min_child_weight': sp_randint(1, 11),
              'num_parallel_tree': sp_randint(0, 2),
              'colsample_bylevel': sp_randint(0, 2),
              'colsample_bynode': sp_randint(0, 2),
              'colsample_bytree': sp_randint(0, 5),
             'n_estimators': [100]}


xgbr_rscv = RandomizedSearchCV(xgbr, param_distributions = parameters_xgbr_rscv, cv = 5, verbose = 2, n_jobs=4)
xgbr_rscv.fit(X_train_scaled, y_train_scaled)

xgbr_rscv.best_params_
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  50 out of  50 | elapsed:    1.1s finished
Out[954]:
{'colsample_bylevel': 0,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0.1,
 'max_depth': 9,
 'min_child_weight': 7,
 'n_estimators': 100,
 'num_parallel_tree': 1}
In [966]:
xgbr_rscv = xgboost.XGBRegressor(colsample_bylevel= 0,
 colsample_bynode= 0,
 colsample_bytree= 1,
 gamma= 0.1,
 max_depth= 9,
 min_child_weight= 7,
 n_estimators=100,
 num_parallel_tree= 1)

xgbr_rscv.fit(X_train_scaled, y_train_scaled)

xgbr_rscv_score = cross_val_score(xgbr_rscv, X_test_scaled, y_test_scaled, cv=5)

print('Model Score  in itteration 4 for XGBoost Regressor using RandomSearch CV:%.3f%% (%.3f%%)'%(xgbr_rscv_score.mean()*100.0, xgbr_rscv_score.std()*100.0))
Model Score  in itteration 4 for XGBoost Regressor using RandomSearch CV:85.684% (3.371%)

8.d. Model Performance Range at 95% Confidence Interval

In [956]:
# configure bootstrap
n_iterations = 1000 # Number of BootStrap samples to creat
n_size = int(len(df_scaled)*0.75)   # Picking only 75% of the given data in every bootstrap sample
values = df_scaled.values

# run bootstrap
stats = list()
for i in range (n_iterations):
    #     prepare train and test sets
    train = resample(values, n_samples = n_size)   # sampling with replacement
    test = np.array([x for x in values if x.tolist() not in train. tolist()])  # picking rest of the data not considered in sample
    # fit the model
    model = xgboost.XGBRegressor()
    model.fit(train[:,:-1], train[:,-1])
    # Evaluate the model
    predictions = model.predict(test[:,:-1])
    score_test = model.score(test[:,:-1], test[:,-1])
    print()
    print(score_test)
    stats.append(score_test)
0.8987743477325063

0.9258154947580562

0.8935204435614589

0.92835261338337

0.9422305674757511

0.9237281660427765

0.859614664540431

0.931070482903789

0.9097575233526627

0.9279019245408371

0.9403754487149523

0.8687812560894651

0.9051823590146456

0.9322794615507781

0.8964106406845651

0.9433946799089508

0.9191599304656314

0.843947203382712

0.9357958916834901

0.865056956441831

0.9253462600405121

0.9056510718131731

0.9000591863030423

0.9225628754593825

0.9154645565199376

0.9036358943153139

0.947476318543235

0.9153483219041001

0.9115224693519284

0.8883209072196956

0.8771514033863884

0.9061242385202248

0.9274642915130353

0.8661894871568722

0.8508236002390788

0.9045407054716724

0.9169176706063069

0.9457018209686474

0.87734732984255

0.899017700425084

0.9055660624780877

0.9290507108870548

0.9351133857370026

0.9461592442469513

0.9356304504318963

0.9455590460879595

0.9059315397878321

0.9116847681328654

0.9237358111531673

0.8998982589386635

0.9102876077907355

0.9624341514996193

0.9191985382737597

0.9329783541733488

0.8579386698155033

0.8592218845639854

0.9251978978095641

0.9153531830408483

0.9073801757465352

0.8671562174220048

0.9253583241290009

0.8898637953937162

0.9132410974780508

0.9225516792470058

0.896925241659498

0.9047843311293766

0.9375857350969231

0.8935569202539585

0.9287476969646354

0.8631417944337417

0.9329063849241749

0.9247249361106059

0.8957455152565114

0.8782733417600628

0.898599330023885

0.9444140154355523

0.8683847980543353

0.9214817757477557

0.9218066889420317

0.8607760533993795

0.9039160180032607

0.9415868020144724

0.8656123785842927

0.9029477475075915

0.9279289821164911

0.8904146166749538

0.909803422298688

0.9115951298884333

0.8944746482555566

0.9010803626899774

0.8295927870515776

0.9193985392692904

0.8781412117683621

0.8965563843687626

0.9520956446639524

0.8774491831655968

0.9481604236376575

0.9168794795640971

0.906048740670423

0.9046277913397395

0.9017542835053187

0.8596242090855613

0.9367318429987976

0.9007217457595517

0.8643089137082673

0.9348138641549502

0.8967083618972149

0.9170548801715351

0.904419825174844

0.9339091816567578

0.9189966025096881

0.8750947563646484

0.9035680291558287

0.8855704717046907

0.9296224146904772

0.9182142843913073

0.8414607116932897

0.9042008713461636

0.9437321266682335

0.9400935149830727

0.8987260279145306

0.9168915496139034

0.8795126307264491

0.9305269689722482

0.9347844356448323

0.8598427158911712

0.9268774086009965

0.8743666233218408

0.9352641712890366

0.9138529968695827

0.8986263401784379

0.8953281260412065

0.8560094414951671

0.9210898537151803

0.9136236061360591

0.9346347846755139

0.9152793230986909

0.9199346733109407

0.9311056395309623

0.901898395420096

0.8748357776960718

0.9032278822401228

0.9165027848366097

0.9084775881850343

0.9318862277460473

0.8736261320304587

0.886425700871097

0.9085431555618456

0.8919011971990766

0.8879631214282635

0.8674937906681434

0.9396790783666702

0.929710422384272

0.9343275840224823

0.9329033951436705

0.9130896624800195

0.9071085940973401

0.9135682582064754

0.9301142463750501

0.8793523114604028

0.901172758205638

0.9147738545773008

0.936822546883828

0.9250663504220323

0.891920873254665

0.9505921951179358

0.9425184431998832

0.9277186128621788

0.8853600790774819

0.9305677913506858

0.9157992977715722

0.9498920384633758

0.8557298536047588

0.9435439288185944

0.9152030188454582

0.9408413367780784

0.8768660211384979

0.853185895804462

0.893369502309984

0.9103922738395936

0.9312126208380134

0.9251566933746146

0.8894988207326313

0.9119685177611088

0.9376041383218916

0.9309974333621703

0.8586953012206868

0.9381591711040775

0.8736014534316909

0.9254590854346794

0.9299351233377714

0.9172571534322678

0.8839862336379128

0.9225735011369177

0.8923703957005963

0.9078804683381567

0.888770641946464

0.9238986957071945

0.9041836746956213

0.9161968356896121

0.9086279697876476

0.9066609315765805

0.9351107839020135

0.9418890271288677

0.9322452841082712

0.904641385221942

0.9267997841240135

0.9169160543713335

0.905551319372371

0.8940663824379176

0.8975223225141881

0.9247791162430844

0.9100187864612099

0.919965902220059

0.937280581279557

0.9380492272418248

0.9125748233547641

0.8943762777597335

0.9292081748924294

0.8840043861642858

0.9407554950397198

0.8298372546972848

0.9008945217750121

0.9092029308784414

0.9049398076577716

0.9173730304463495

0.9421191142981772

0.952404607992447

0.8674677208960851

0.923040613129835

0.890817473451617

0.9085671050519053

0.9087029820436101

0.9160675831115531

0.9035196410049131

0.9153877738624041

0.8435887641864512

0.9649476606927582

0.9216352328410987

0.9016679554559474

0.8780605016119081

0.9103012174045909

0.9022492006099436

0.9328686684922978

0.8099589230866434

0.9307249834383375

0.9515964705403405

0.9433801536216132

0.9044574696874531

0.9162601443407551

0.9177848769278808

0.9393966264954869

0.9188004644206981

0.9342799386568479

0.9242363701449163

0.9285058917389564

0.9041345221665478

0.8978997032484285

0.8931735374024554

0.7936448532490817

0.9344040230443734

0.9538618158582444

0.9201705490531357

0.9095638177348013

0.9103429792166856

0.8997350860338968

0.8969027060313065

0.9180780664591357

0.9224405851862756

0.8924946239864897

0.9030174899971106

0.9276669871627771

0.8848342440366115

0.8887753076501546

0.9181508090091953

0.9058957437717857

0.9278489348382549

0.9211078282906258

0.8853469577842639

0.8684141103467211

0.8957065250604882

0.9346777828717655

0.9420280960708614

0.9416895501447978

0.9239467360941411

0.9080222400953886

0.9226045939663658

0.9183112241830307

0.9152852445010907

0.9053142168570577

0.9000561076404926

0.9189862077644393

0.9347113829894106

0.9064185222183037

0.9003581035122218

0.9271900230016392

0.9195422356076176

0.9138811495760307

0.9404848710742477

0.8917374042639705

0.9206768048371432

0.9002705114507723

0.9099999822189955

0.8671864457230419

0.8806990823355528

0.9372101769093981

0.9113882824343964

0.9136913644686256

0.8894530662069303

0.9480277772055429

0.9137704238354815

0.9463626513054789

0.9199663505323228

0.9065148455313157

0.8672773452871224

0.8084449635303086

0.9351029457194217

0.8945897250268925

0.9494078625618803

0.9209403944418232

0.8909621244315629

0.9093642449248751

0.8805756728277513

0.9078937004016162

0.9026815599658262

0.9055396430652389

0.8805420585273362

0.8895302127178055

0.864430891946831

0.9510945712751617

0.9337788404721722

0.8807353207033021

0.9082070606261123

0.9171821877923519

0.889404942086642

0.9193439317403234

0.919097621668604

0.9419420681417727

0.8811954204817696

0.9251589078415673

0.9412714743484066

0.9437815910191111

0.9191383461386362

0.9140808986129451

0.8821167439008323

0.9266437865776223

0.9613349330388862

0.9034157659992488

0.8927089599625249

0.9160741037695691

0.8980449520329881

0.8059557985638224

0.8916309839572638

0.8719933321581134

0.953045748340936

0.9420171150731507

0.9139914598870359

0.9208384990939166

0.9100335391394333

0.8967896842324133

0.884015378281273

0.8841512120446263

0.9075149563741445

0.9098205946018862

0.8944026503255416

0.8862975319624431

0.8595147597025966

0.9268982486269051

0.9151228499188248

0.8967564891964735

0.8614490941523469

0.8929738812303221

0.9259859570745994

0.9262115318798215

0.9043240326546378

0.8968797121023893

0.8500290170397272

0.9067589738647565

0.912862716523668

0.8917866885825408

0.899412816906439

0.8950340967995641

0.9133978043025924

0.9624221985414694

0.9180301375233065

0.8891963218967879

0.9098160649213141

0.8570507985407839

0.9475885843105113

0.9279091801104932

0.8417692532113992

0.9391202384291212

0.9006413192796736

0.9326566504413638

0.9444279573043229

0.912736514204652

0.927016005921805

0.9457563694442433

0.9241429125085941

0.8940357488299816

0.9024907838655172

0.9125726395583229

0.8935696678118665

0.9005538538071598

0.9369586674762561

0.9585090779089094

0.9193356150064449

0.862571182002559

0.9430535545135884

0.9189924372722647

0.8242226954759816

0.9053209084527838

0.9086455060287877

0.9339567877263563

0.9136049800749968

0.8940129992421679

0.889418485101142

0.9027791043140908

0.8931769342939471

0.931954327377515

0.8747845530198427

0.9302477000440742

0.9118004479411171

0.9258126063398807

0.9437712289282278

0.9003203540256749

0.9386364713678818

0.8795781463028408

0.9246792638168959

0.9293943771865385

0.9393967529194753

0.9298456441237931

0.9031146319217918

0.9104193581330902

0.9248024025417925

0.8588120072181775

0.9260904536934786

0.942556821812865

0.9122438233080915

0.9198054493924211

0.9074707536367848

0.893749868123292

0.9267417845604063

0.9116054266793068

0.9300340590595721

0.9148439770743331

0.8243815247079747

0.9089118492768307

0.8942321870616012

0.8845677534366503

0.9424023063097736

0.8811986611845801

0.8905465005109829

0.9302469996130053

0.9276136003550617

0.7981861965264829

0.9219285519955113

0.8778208721649864

0.8641686475555872

0.8976711496555663

0.8852809830711255

0.8851914852095396

0.913638465732685

0.8965693223771146

0.9346838918974812

0.9449573037316849

0.9408110041688208

0.8932365757637968

0.8870573702853306

0.931895327164641

0.9221223658712872

0.8723285904047196

0.9094732656519928

0.9119550567319136

0.9022096057346334

0.9323566202971011

0.8759488653670695

0.9342906123936823

0.9290276886598927

0.9026520852902303

0.8844972973419257

0.9517085669740274

0.9139974845149723

0.9079391845181503

0.9123643141690713

0.9050312998522124

0.878532469809847

0.9348691712738186

0.9275480258945932

0.9383640759608829

0.9292110797501765

0.921345657665922

0.913309031226854

0.9300891199302112

0.9392237523556309

0.9174221869849178

0.9122199934947914

0.9206654499270737

0.930750320062985

0.9152205415743122

0.9328173146352351

0.925719166174635

0.9252851468459427

0.9225778911714453

0.8693236037881008

0.9202867642059989

0.9115329823808684

0.9368704384501879

0.908154566947363

0.8353865286027147

0.9082051324683492

0.8967329044946769

0.8977743874670572

0.9158785605085513

0.8900963494665397

0.8786281793609482

0.8815231068630481

0.8578190836524352

0.9132052449108977

0.922442330326847

0.8534310292509149

0.8891636277632479

0.9235830921429222

0.8904376572594513

0.938720927759614

0.9386450318840587

0.9153576883286708

0.9163445970543052

0.9503788743171693

0.9124094790087316

0.9035592608832901

0.9122037328916157

0.9137833020812229

0.8831106171467066

0.8855185556413251

0.8804280667109581

0.895264279276394

0.8934532713003108

0.8821015586591079

0.9088532942876713

0.8897685689151598

0.899698398491733

0.9161957014446238

0.9095354450872407

0.9177732798258661

0.919254632791735

0.890776563700045

0.8980544966337306

0.9207237137025839

0.9077584948500246

0.9231532256644593

0.9389064264598525

0.8563954385956164

0.9022587780911377

0.9296949252574139

0.9473174797641568

0.8793127408017132

0.9137763799446922

0.8653628601722777

0.8220204404436905

0.9213345611624221

0.9136346900074778

0.9080709018178363

0.943921712959073

0.9526137684831946

0.8975852216983593

0.8923966908396009

0.9208845088103064

0.8831115281607929

0.8605982262455265

0.9088162573986355

0.8819563017399711

0.8803372993984446

0.9353271737476455

0.8735217766449777

0.9577874682008948

0.8941550989469194

0.9462748646119374

0.9450669429474294

0.8587035528551001

0.8628487451700381

0.9019158884977599

0.9342091914171113

0.8973580396183324

0.8677755310191199

0.9179068320451118

0.9337542647638271

0.8741648185977388

0.8566252791955365

0.9471727529586325

0.8932795829878437

0.921490285757445

0.9227459127060349

0.9166780527410511

0.8893084612489428

0.9372084341985848

0.8221671906000076

0.8763010708089538

0.9182417102201462

0.9391077533633594

0.895924052207014

0.8824279192755842

0.9076669205465864

0.9187723527533481

0.9265116960732489

0.9034351696977017

0.9073515451997214

0.9423471358295951

0.9223941905321336

0.9336513263349298

0.9183277307825205

0.8810300096895312

0.9274784856728553

0.9139683141822517

0.9142633837032302

0.910463876281406

0.907149977655847

0.8783780707387309

0.9300897644153466

0.9390092817495016

0.9047980314138561

0.8969448254482001

0.9052731893488114

0.9444655548465045

0.8997069754138548

0.9543125721782437

0.9185043948208766

0.9362522373201422

0.9200134089982911

0.845755590505891

0.9020170509309459

0.881182324179235

0.8889757904908185

0.9236135047508979

0.9244531727244355

0.8735201826881771

0.8487436138609348

0.9362390251729342

0.8874080127642332

0.9003250774161856

0.9252479392502483

0.9073807291001336

0.9096168840369394

0.9036659911551808

0.92268871439879

0.9178805414531369

0.9014182826841524

0.9171482497586143

0.9232692164139161

0.9116812137595662

0.9220822652353368

0.9104118007165062

0.9000934315733937

0.9376695665994434

0.9001577296700333

0.9405428318877138

0.9024861883520647

0.8875500903632892

0.9408956651627133

0.9427754938505744

0.9112405738357859

0.9254228451103462

0.9096569051615679

0.8707339299858303

0.9132795539326962

0.9050879390549231

0.8952361350915133

0.9484995636095095

0.8836635836702393

0.8939574978532775

0.9097353008163239

0.9323162387168112

0.9269666883343216

0.9160796085925114

0.9004200390611217

0.8807801541907964

0.9234335459688783

0.9229017878283527

0.9257834873951649

0.9385139487598052

0.9114225975908832

0.9529389666402105

0.9316096891564859

0.9254222500242295

0.9103072860606034

0.8885131332736681

0.9206001538761766

0.9233582673065445

0.8793860610138905

0.9104299765363929

0.9343171921297828

0.8227947941609978

0.8859763955479962

0.9277147580972497

0.9312900490285693

0.8910571194655195

0.9292953418884681

0.8861864482443232

0.9259220285917796

0.9091088654065483

0.9184815572090809

0.8477407002424878

0.9220212018406299

0.9515398356390472

0.9329866503628212

0.9064966002776871

0.8992339425483954

0.9296150963028172

0.8892770957074458

0.9518304307326654

0.9236606992993204

0.8952957741717372

0.9130773516986759

0.9376599464348628

0.9405725610741336

0.9131082073901503

0.9127573595496155

0.9013322360795839

0.8643392345852452

0.8794038149502548

0.920355014644264

0.8765763186195348

0.9180892972877024

0.855298821705496

0.9322770068254399

0.9119054392834847

0.9355422973064443

0.9355556315976838

0.9273449888963096

0.951379930924827

0.9275219337698514

0.9190065763211531

0.8839504107059133

0.9068082940736923

0.8908498428164267

0.9267594805819703

0.8499012666791128

0.8385714691098196

0.8919801293307406

0.9264158671537865

0.9128957272904411

0.8751391004175296

0.9406339705285203

0.8510468302355685

0.8657125415414197

0.9166732572665587

0.8517043154035797

0.9273581344026636

0.8657038748806309

0.8986041742752282

0.883024353646184

0.9035631625358634

0.9530002434292213

0.9031126456353523

0.9110503708935199

0.9072845371507745

0.9120270179066998

0.8697499157810884

0.8626603215019536

0.8588616385030716

0.9224861194040445

0.8928601794127932

0.9014213761184849

0.9212196385257834

0.8536254236693572

0.9280759538032858

0.8815365078921255

0.914554756606091

0.8855415723363391

0.9278034655761067

0.9202365658140994

0.9158656300404955

0.8582524630227351

0.9393495919597163

0.9156618382882543

0.9195919362546877

0.9308655052765936

0.9094343442484105

0.9176071028282875

0.8829832035622247

0.936333775512217

0.8932851167853425

0.9074149298437254

0.8872638498309098

0.8679144604020094

0.8364881796047693

0.9172311619628595

0.9082350060248714

0.9245327310224836

0.8365991524882419

0.9167657987575091

0.9057916939676849

0.9257927326043119

0.9264442206951149

0.8687008748203344

0.8982094449064169

0.8816991533299683

0.8834620322016855

0.9138214334412436

0.923361650316313

0.900972418499519

0.91290685607559

0.9174260806162635

0.9437733870581222

0.8907506015561015

0.8954781446242621

0.937879366943712

0.8694448371756351

0.9420106174493583

0.9267764105195501

0.9080551340095783

0.9026454338623

0.9181508905103334

0.9442945502913442

0.9160938795737222

0.9197572835165595

0.9090517795166871

0.8008624896921648

0.9219666424921522

0.92528372167099

0.9142741221469938

0.8683375851489624

0.8976989875875919

0.926834998073197

0.8879172841904829

0.903723567177978

0.9180953279890404

0.9349675749777885

0.9433075958969368

0.9076866339078462

0.8382965276304803

0.8400194043452442

0.930348960449167

0.8871108448694794

0.8757471817004522

0.8916702354331776

0.876477836792114

0.906229210664218

0.8859103090924114

0.9023489720955935

0.9213914217247866

0.8826405799164111

0.9224147721671792

0.9121105005603712

0.9345238482553536

0.9019218554543756

0.9228285317861517

0.9268726616274132

0.9293538273601338

0.937970581571983

0.9214766877773258

0.936897959193402

0.8675241325762079

0.9293306991404973

0.9228428386051887

0.8987598227649578

0.8955007239168432

0.8845765363717888

0.924572354322034

0.9348255363557557

0.9297389324500246

0.941507806377776

0.9176051055270432

0.9102010572926716

0.9456983853182085

0.9070277529335088

0.9200539050168183

0.9043049220285404

0.9339530985976177

0.9463036956797874

0.9391141525040947

0.8910275751617401

0.9212989176242935

0.9252102120617751

0.9193801159158281

0.9389127352043956

0.9303461507234866

0.9127382489002317

0.9102181467736944

0.9099673638857851

0.903261975245606

0.8633453641117081

0.946110344851151

0.9502393839218902

0.9003187097630998

0.885053530587303

0.9145683464690181

0.9120159306442752

0.9164400194917983

0.8473748216193855

0.9222432901276852

0.8647299907948056

0.8536300892121818

0.9076252017336641

0.9027390647978613

0.8801612044566861

0.8457533536165248

0.9035929574736457

0.8960217900447661

0.9364492257205981

0.9307489098747064

0.8698386821778894

0.9132943040221988

0.9258266043501676

0.9178192903673311

0.9120151532341021

0.9225263742942339

0.9036965016944927

0.9290136069700712

0.8890024161454957

0.9212254615045884

0.9024017822190935

0.8940974216210702

0.9535795234497081

0.8798477332340968

0.9269167726016098

0.9241455662964664

0.9225804515917091

0.9021451639304612

0.9263899463156132

0.9487049225505116

0.9064295707742973

0.9476219343815696

0.8863858299753843

0.9321834775029767

0.8912830493579037

0.8787527496024125

0.9328194235238181

0.924947016359893

0.9454038312655058

0.9037807163303833

0.9193918166202492

0.915031464503171

0.9037669323332964

0.9336936208427769

0.9293229647597236

0.9076555305935964

0.9083465320146882

0.8962723411154503

0.8973017110838946

0.9409741725586668

0.9006786710034802

0.8902443136871443

0.9182297620657344

0.9328370440184811

0.9105700528766161

0.9213263615641282

0.9458742291473149

0.9104254783965491

0.9167575841461749

0.928123318914952

0.9265789061678786

0.9280719979412795

0.8628459922006594

0.9041330939726788

0.9285543303687199

0.9327754620409181

0.909690039422563

0.9079438074129673

0.9353885438647809

0.9307830572036521

0.9182778839975348

0.9128263212323503

0.899953777913759

0.9228467117001209

0.9148409286523208

0.8828575230688661

0.8909404274627515

0.9032803077568731

0.9120785112531029

0.9021064986507337

0.9182416819173628

0.8752269742866698

0.9070808204230643

0.8765405113067539

0.9158750147355017

0.9191724573372139

0.9634572789002321

0.847103407772931

0.866411934963424

0.9161582623226969

0.9097053507451885

0.9448671812481527

0.9031499413927988

0.8477725257001039

0.9379812771372781

0.9111159532096241

0.9034754666656366

0.8795515373193993

0.8674267760290011

0.8894561751579709

0.9211757431472593

0.8803839178626083

0.904965179212534

0.9083353069522139

0.9169289224804446

0.916190242366179

0.9047548038671039

0.9239257744753006

0.9330363564893474
In [957]:
# plot scores
pyplot.hist(stats)
pyplot.show()

# confidence interval
alpha = 0.95   # for 95% confidence
p = ((1.0-alpha)/2.0)*100  # tail regions on right and left 0.25 on each side indicated by p value (border)
lower = max(0.0, np.percentile(stats,p))
p = (alpha + ((1.0-alpha)/2.0))*100
upper = min(1.0, np.percentile(stats,p))
print('%0.1f confidence interval %.1f%% and %.1f%%' %(alpha*100, lower*100, upper*100))
95.0 confidence interval 84.7% and 95.0%

9. Conclusion

Basis of selection of algorithms which are suitbale for this project......

In [958]:
result_itteration1
Out[958]:
Algorithm Model_score Root Mean Square Error Cross_val_score Std_Dev
0 Linear Regression 69.965042 0.506141 59.191047 7.307688
1 Ridge 69.958537 0.506196 59.192380 7.300019
2 Lasso 37.632265 0.729353 33.833772 5.797047
3 Deision Tree 58.310412 0.659683 60.047945 6.644179
4 Support Vector 82.944132 0.421947 83.897852 2.909932
5 Random Forest 90.357089 0.317267 91.578826 2.396392
6 XGBoost 92.195915 0.285419 93.556535 2.498270
7 GradientBoost 86.618808 0.373739 87.411868 2.211392
In [959]:
fig = plt.figure(figsize = (22,5))



plt.title ('RMSE values for various Algorithm',y=1, size = 22, color = 'red')

sns.barplot(y = result_itteration1['Root Mean Square Error'], x = result_itteration1['Algorithm'], facecolor = (0.5,0.5,0.5,0.8), linewidth = 10, edgecolor = sns.color_palette ('dark', 12) );

plt.ylabel('RMSE', size = 20)
plt.xlabel('Algorithm', size = 20)
plt.tight_layout()
In [960]:
result_itteration2
Out[960]:
Algorithm Model_score Root Mean Square Error Cross_val_score Std_Dev
0 Linear Regression 60.347237 0.670540 55.562938 6.458897
1 Ridge 60.336035 0.670653 55.563602 6.456769
2 Lasso 30.019455 0.948238 29.427266 7.810435
3 Deision Tree 52.918005 0.688565 51.525774 8.090367
4 Support Vector 75.984861 0.518199 74.430066 5.612207
5 Random Forest 82.793223 0.414309 83.873459 3.415621
6 XGBoost 81.663114 0.411717 83.776223 3.757386
7 GradientBoost 79.034399 0.466321 79.741688 3.312904
In [961]:
fig = plt.figure(figsize = (22,5))



plt.title ('RMSE values for various Algorithm',y=1, size = 22, color = 'red')

sns.barplot(y = result_itteration2['Root Mean Square Error'], x = result_itteration2['Algorithm'], facecolor = (0.5,0.5,0.5,0.8), linewidth = 10, edgecolor = sns.color_palette ('dark', 12) );

plt.ylabel('RMSE', size = 20)
plt.xlabel('Algorithm', size = 20)
plt.tight_layout()
In [962]:
result_itteration3
Out[962]:
Algorithm Model_score Root Mean Square Error Cross_val_score Std_Dev
0 Linear Regression 56.613860 0.698418 52.267753 7.481897
1 Ridge 56.608356 0.698479 52.269813 7.478485
2 Lasso 25.878016 0.944048 24.473386 7.239213
3 Deision Tree 54.498610 0.704209 56.634475 4.983943
4 Support Vector 72.140254 0.553305 73.290340 7.813029
5 Random Forest 83.620314 0.465409 84.352936 3.886086
6 XGBoost 82.231836 0.470004 83.526744 4.072049
7 GradientBoost 79.507972 0.495998 79.433124 3.367108
In [963]:
fig = plt.figure(figsize = (22,5))



plt.title ('RMSE values for various Algorithm',y=1, size = 22, color = 'red')

sns.barplot(y = result_itteration3['Root Mean Square Error'], x = result_itteration3['Algorithm'], facecolor = (0.5,0.5,0.5,0.8), linewidth = 10, edgecolor = sns.color_palette ('dark', 12) );

plt.ylabel('RMSE', size = 20)
plt.xlabel('Algorithm', size = 20)
plt.tight_layout()

Observations:

  • From the above iteration, we have taken the models which have least RMSE value and highest model score.

  • For linear repression, cost function is the RMSE, which is to be minimized.

  • So the models like Random Forest and XGBoost Repressor are the best models which have least RMSE value and maximum model score.

  • Based upon this criteria, we can take these two models for further tuning and to squeeze the extra performance out of the model without making those under fit or Over fit.

In [964]:
# plot scores
pyplot.hist(stats)
pyplot.show()

# confidence interval
alpha = 0.95   # for 95% confidence
p = ((1.0-alpha)/2.0)*100  # tail regions on right and left 0.25 on each side indicated by p value (border)
lower = max(0.0, np.percentile(stats,p))
p = (alpha + ((1.0-alpha)/2.0))*100
upper = min(1.0, np.percentile(stats,p))
print('%0.1f confidence interval %.1f%% and %.1f%%' %(alpha*100, lower*100, upper*100))
95.0 confidence interval 84.7% and 95.0%

Model Performance range at 95% confidence level

Since the model XGBoost Regressor has provided the best result by minimizing the cost function and maximizing the model score, we will go with it as a suitable model.

And after tuning the model by employing techniques to squeeze the extra performance out of the model without making it overfit or underfit we achieved a model score of 86.05% with a standard deviation of 6.85 at 95 % of confidence level. So the range estimate of my model score is in between 79.2% and 92.9%.

This range estimate is at 95% confidence level. That means with 95% confidence level we can say our model score will range between 79.2% and 92.9% in production.

Almost the same amount of model score (85.68%) we are also getting out of XGBoost regressor algorithm with a standard deviation of 3.371% while using Randomized Search CV Hyper Parameter model tuning technique.

Summary:

  • In this data set, we have tried to predict the strength of the Concrete based on the proportional quantity of some ingredients and age or incubation time period required to induce the compressive strength in concrete.
  • The data set has very less challenging as far as data cleaning is concerned. It was almost a cleaned dataset except the presence of some duplicate data points.
  • From EDA part, we tried to draw the relation between target feature and independent features.
  • We have also checked the range, IQR, Skewness and presence of outliers.
  • Outliers were being addressed by using median value of each attributes, which in turn generated some more outliers due to sharpening of the distribution and decrease in Standard Deviation.
  • We have tried to build clusters by using K-means clustering algorithm from unsupervised machine learning technique.
  • Feature importance were also been checked for all the features by using VIF value and Decision Tree, Random Forest, XGBoost & Gradient Boost regression techniques.
  • Two Composite features were also been created to predict the target column.
  • For this data set we made total 8 different models to predict the target columns such as Non-residual linear Model, Residual Linear models like Ridge & Lasso, Decision Tree Regression Model, Random Forest Regressor model, Support Vector Regressor model, XGBoost Regressor Model, Gradient Boost Regressor Model.
  • Based upon the residual plots, Ridge and Lasso methods, complexity of the model has also been checked. The Lasso model acted like a Dimensionality reduction technique to reduce the dimension of the data set and predict the target column. Based on the inference drawn, simple linear models were created to predict the target column.
  • Polynomial features were also been created to check the prediction capability of None regularised and regularised models.
  • K-fold Cross Validation technique has also been implemented to predict the target column.
  • At least based upon the Cost function which is Root Mean Square Error (RMSE) for regression technique and model score, we selected two models for tuning by further employing feature engineering techniques such as Grid Search CV and Random Search CV without making those overfit and underfit.
  • And out of these iteration we elected XGBoost model over Random Forest Regressor to go with and thus, we calculated the Model performance range at 95% confidence level.
  • From the above iteration, we have taken the models which have least RMSE value and highest model score.
  • For linear repression, cost function is the RMSE, which is to be minimized.
  • So the models like Random Forest and XGBoost Repressor are the best models which have least RMSE value and maximum model score.
  • Based upon this criteria, we can take these two models for further tuning and to squeeze the extra performance out of the model without making those under fit or Over fit.

THE END